Scaling Performance of DNN Inference in FPGA clusters using InAccel orchestrator

Customized compute acceleration in the datacenter is key to the wider roll-out of applications based on deep neural network (DNN) inference.

A great article by Xilinx Research Labs shows how to maximize the performance and scalability of FPGA-based pipeline dataflow DNN inference accelerators (DFAs) automatically on computing infrastructures consisting of multi-die, network-connected FPGAs. Xilinx as developed Elastic-DF, a novel resource partitioning tool which integrates with the DNN compiler FINN and utilizes 100Gbps Ethernet FPGA infrastructure, to achieve low-latency model-parallel inference without host involvement. Elastic-DF was applied to popular image classifiers ResNet50 and MobileNetV1 at XACC and provides significant throughput increase with no adverse impact on latency.

The paper, entitled “Elastic-DF : Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning” demonstrates that model-parallel multi-FPGA execution enables superlinear scaling of DFA performance in multi-FPGA systems, for both MobileNetV1 and ResNet-50. It also shows how constraints to Elastic-DF allows to implement software-transparent model parallelism (TMP) and also analyze how TMP compares with other forms of MP.

The MLPerf evaluation on multi-FPGA accelerators has been done using the InAccel Coral orchestrator and resource manager.

Coral is a fast and general-purpose accelerator orchestrator. It provides high-level APIs in C/C++, Java and Python which enable a user (the client) to make acceleration queries to a unified execution engine (the server) that supports every heterogeneous multi-accelerator platform. InAccel also
provides a runtime specification that vendors can use to advertise system hardware resources to Coral.

It aims to specify the configuration, and execution interface for the efficient management of any accelerator-based hardware resource (e.g. FPGAs), without customizing the code for Coral itself. In Coral, client applications call an accelerator on a local FPGA as if it were a software function, making it easier for the user to accelerate distributed applications and services. Coral users define a platform, i.e. the accelerators that can be called remotely with their arguments and configuration parameters. Coral clients and servers can run and communicate with each other in a variety of environments — from bare-metal Linux servers to containers inside a Kubernetes cluster — and applications can be written in any of Coral API’s supported languages.

The key concepts of the InAccel runtime specification are the following: resource, memory, buffer and compute unit. InAccel offers a default OpenCL-based Xilinx FPGA runtime implementation, where each resource object represents a single FPGA device with a list of memories and compute
units respectively, entities that is dynamically updated upon reconfiguration of the FPGA. Lastly, buffer objects belong to a specific memory inside the resource context, which corresponds to a physical memory region on the FPGA board’s DDR memories.

Dual-FPGA abstraction for Coral: To support multi-FPGA accelerators, we developed a custom InAccel runtime, enabling initialization-time configuration of the VNx and run-time synchronization between accelerator segments where applicable (HwMP, SwMP). In this new abstraction the resource object hosts the context of two devices (e.g. 2 Alveo U280), instead of one, but also a unified memory topology is announced to Coral. Each bitstream artifact is now a tarball that includes two named binaries, while the kernel metadata contain directives that indicate in which binary they reside.
By abstracting away all the lower level details, our Dual-FPGA Coral runtime is able to transform any dual-FPGA cluster to a single pool of accelerators. This enables us to run existing InAccel MLPerf test harnesses against our dual-FPGA accelerators with no changes to application code.
This runtime can be extended in principle to any number of FPGAs.

You can find the paper in this link.

Applications Acceleration instantly