Simplify AI Deployment at scale on FPGA clusters

Use of InAccel cluster FPGA manager allows higher utilization of the FPGA resources.
  1. Designed to deploy any framework model as is: Developers can
    rapidly deliver their AI-integrated applications, and IT/DevOps can
    deploy retrained models without bringing down applications. Data
    scientists have the freedom to choose their preferred framework
    and network, and both developers and IT operators can support
    these trained networks without incurring unneeded complexities.
  2. Designed for production IT/DevOps infrastructure The InAccel Coral FPGA cluster manager is a Docker container fully compatible with Kubernetes, the container management platform for orchestration, metrics, and auto-scaling.
    It also integrates with Kubeflow and Kubeflow pipelines for an end-
    to-end AI workflow. It exports metrics for monitoring for better visualization of the utilization. All these integrations help IT deploy a standardized inference-in-production platform with lower complexity, higher visibility into resource utilization, and scalability.
  3. Designed to scale and maximize FPGA utilization: InAccel Coral manager can be used to run multiple models concurrently on a single FPGA for optimal FPGA utilization. And it can easily scale to any number of servers to handle increasing inference loads for any model, eliminating
    inefficient single-model-per-GPU deployment.
    The figure illustrates the single-model-per-FPGA scenario. There are
    4 FPGAs and 1model/user run on each FPGA. Requests for the second and fourth model peak, and the corresponding two FPGAs show very high FPGA utilization while the other FPGAs have low utilization. No more requests can be handled.
    The figure below shows the same scenario with the InAccel cluster manager.
    Here, each of the four FPGAs can run any — or all — of the models. When requests for the second model peak, the requests are equally balanced among all the FPGAs. This cluster easily scales to 3X higher utilization.
  4. Designed for real-time and batch inference: There are broadly
    two types of inference — real time and batch. Real-time inference
    needs low latency, as end users are waiting for the results while the
    inference happens. Batch inference is generally done offline and
    can trade latency for high throughput. Some organizations need
    both types, as that expands their use of AI in their products and
    operations. InAccel coral Manager supports batch inference to
    increase utilization and, at the same time, allows latency limits for
    real-time inference.




Applications Acceleration instantly

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

An in-depth look into Kambria Network’s Verticals

Technology Plays an Ever Greater Role in Ways We Live Each Day

Electronic depiction of a hand holding a circuit.

Microsoft is not coming slow towards computer vision face API.

What We’ve Learnt So Far: Part 5.4

Announcing the ODSC Europe 2021 Keynotes

A Beginner’s Understanding to Artificial Intelligence (AI)

Will AI dehumanize the workplace in software companies?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Applications Acceleration instantly

More from Medium

How to make GPU inference environment of image category classification production-ready with…

Deploying PyTorch Model to Production with FastAPI in CUDA-supported Docker

Kubernetes AI Day North America 2021 Recap — Part 1

Text Detection in Spark OCR