Faster Inference: Real benchmarks on GPUs and FPGAs

Inference refers to the process of using a trained machine learning algorithm to make a prediction. After a neural network is trained, it is deployed to run inference — to classify, recognize, and process new inputs.

The performance of inference is critical to many applications. Each application has its unique requirements in terms of throughput (fps), latency (ms), energy efficiency (fps/watt) and cost efficiency (fps/$). This is why there are many available options when it comes to the ideal hardware platform.

Each vendor advertises it’s own benchmarks and usually some of these benchmarks present ideal cases that are not practical to identify the most efficient platform. MLPerf is a great initiative trying to compare the availabel platform. However, still the setup is complex and there different parameters that need to take into account.

In this article we present a realistic and practical benchmark for the performance of inference (a.k.a real throughput) in 2 widely used platforms: GPUs and FPGAs.


GPUs are specialized processing units that were mainly designed to process images and videos. There are based on simpler processing units compared to CPUs but they can host much larger number of cores making them ideal for applications in which data need to be processed in parallel like the pixels of images or videos. However, GPUs are programmed in languages like CUDA and OpenCL and therefore provide limited flexibility compared to CPUs. Also usually they are quite power hungry and the latency is higher than FPGAs. Especially for applications that latency is critical and we need small batch sizes (batch size:1) GPUs offer higher latency compared to FPGAs.


FPGAs provide the lowest latency and the highest performance when it comes to fps. It also provides better energy efficiency and lower cost per frame.

In the past FPGAs used to be a configurable chip that was mainly used to implement glue logic and custom functions. However, currently FPGAs have been emerged as a very powerful processing units that can be configured to meet applications requirements. In fact, using FPGAs we can make tailored-made architectures specialized for specific applications. That way we can achieve much higher performance, lower cost and lower power consumption compared to other options like CPUs and GPUs. FPGAs can be programmed now using OpenCL and High-level Synthesis (HLS) and that’s make it much easier to program than in the past. However, due to the this limitation FPGAs offer limited flexibility compared to other platforms.

The best way to use FPGAs to train a model is through the use of pre-configured architectures specialized for the applications that you are interested. That way you can achieve much higher performance than CPUs and GPUs and at the same time you do not have to change your code at a all. The pre-configured accelerated architectures provides all the required APIs and libraries for your programming languages (Python, Scala, Java, R) and framework Apache Spark that allows to overload the most computational intensive tasks and offload them in the FPGAs. That way, you get the best performance and you don’t have to write your applications to a specific platform/framework like TensorFlow. This is the approach that we have followed at InAccel. We have developed an integrated suite that includes both the optimized FPGA architecture for ML training and the software stack that allows the seamless integration of hardware accelerators without the need to change your code at all. And we have managed to integrated into a Docker container that makes it much easier to deploy and use.

Resnet50 Inference on ImageNet

The evaluation of the two hardware acceleration options has been made on a small part of the well known ImageNet database, that consists of 200 thousand images. While native Tensorflow models can transparently run on a GPU, we also dived deeper and installed TensorRT, a newer gpu inference engine that provides higher performance. This is done, by converting the Tensorflow graphs in lower precision (FP16 and INT8), but with a different and somewhat painful API.
On the other side, the FPGA inference runtime it can be run on a Keras-like API, that contains the pretrained model and the hidden hardware optimized architecture.

FPS for ResNet50 inference on ImageNet (200k images) with batch size 128. In p3.8x one GPU was used for fairness

Below, we list the configuration of the benchmarks.

Configuration for FPGAs:
Precision: Int8
InAccel-Keras: 2.3.1
Image: inaccel/jupyter:lab

Configuration for GPUs:
Precision: Int8
TensorRT version: 6.0.1
Tensorflow version: 2.2.0

As it is shown in this figure, FPGAs can provide much better performance compared to GPUs (up to 2550 fps real throughput).

The main challenge with the FPGA deployment however, was the lack of easy to use framework that would allow machine learning engineers and software developers to instantly utilize the power of FPGAs.

InAccel solves this problem by providing ready-to-use platform for instant acceleration of inference in deep learning applications. InAccel provides the most efficient way to utilize the power of FPGAs for your deep learning applications. For example it provides an online platform where anyone can test for free their applications for deep learning on a Jupyter notebook. Users can login for free and evaluate their applications using the power of FPGAs with zero code changes of their original source jude. Ready-to-use examples are given for Keras ResNet50, Logistic Regression, K-means clustering and Naive Bayes.

Jupyter Notebook for acceleration of ResNet50 on Keras using FPGA instantly
InAccel Accelerated ML suite for instant acceleration of ML using the power of FPGAs

To learn more how to speedup your inference applications, contact us at or check

Applications Acceleration instantly