Dec 13

6 min read

SemiconX:-AI Acceleration Hardware

(Technology Trends Case Study)

Artificial Intelligence (AI) is a powerful tool that will be ubiquitous in the upcoming decade, in applications spanning across defense, automobiles, robotics, healthcare, metaverse, and industry 4.0. The increasing AI model capacities demand scaled high-throughput compute at iso-energy consumption which needs a fundamental rethinking of power saving in compute and dataflow.

The basic difference of a custom AI hardware architecture w.r.t a general-purpose workload is that deep learning computation and dataflow are structured, and the network is known prior to execution, thus the underlying implementation architecture can be optimized specifically to the AI execution datapath and the hardware overhead for control path can be minimized. Because of the enormous potential in this space, it has gained a lot of traction from investors with several AI hardware startups raising ~$4B combined and a total valuation of around ~$10B.

List of highest-funded startups that are active in this space: -

***Reference***:- *https://www.ai-startups.org/top/hardware/*

Classifications of AI accelerators based on the target market: -

The foundational differences in computing architectures can also be classified based on the target applications whether it’s for datacenter-scale AI (with both training and high-precision inference workloads) or for edge computing which deploys lightweight models at low-to-intermediate resolution.

Datacenter scale chips/systems implement reconfigurable precisions arithmetic, a large pool of instruction sets, and software scalability to realize general-purpose workloads. These systems should also support model training (Example:- Sambanova, Cerebras, Groq, Graphcore, Esperanto Tech)
Edge AI, Embedded, and Automotive market chipsets typically focus on low-precision computation (at INT4, INT8, and FP8 precision) and are optimized for specific AI inference workloads. Since energy efficiency is key, architecture is targeted to have low control overhead and simplicity in the execution and data flow pipeline. (Example:- Chips/Cards from UnthetherAI, Hailo, NeuRality, Syntiant, Sima.ai, Mythic, AnalogInference, MemryX)

The following graph compares the performance and power consumption for different AI accelerators and is clustered based on the target market, energy efficiency. Datacenter scale systems often come at system form factor with up to FP64 precision and training support, while processors for the low-power edge, embedded, and autonomous systems market only offers up to INT8/FP8 precision support in a chip form factor and most of them are inference-only solutions.

***Fig****:- Performance vs power chart for existing AI accelerators clustered based on the target market (****Reference****:- Survey of Machine Learning Accelerators (https://arxiv.org/abs/2009.00993)*

Types of AI Hardware architectures:-

i) General-purpose CPU with Custom Optimized ISA:-(Example:- Tesla Dojo microarchitecture)

Traditional CISC or RISC-based architectures, when optimized for specific AI kernels, can provide more resources to support high-throughput computations rather than control heavy executions. Since data center scale systems require flexible precision and need to allow training workloads, such custom ISAs need to consist of a sufficient number of instructions. Typically, these architectures tradeoff energy efficiency to obtain better programmability.

***Fig****:- Tesla Dojo microarchitecture (****Reference****:-https://chipsandcheese.com/2022/09/01/hot-chips-34-teslas-dojo-microarchitecture/)*

ii) Systolic Array Architecture:- (Example:- Google TPU microarchitecture)

Processing elements (PEs) are laid out in a systolic array fashion to facilitate optimal datapath i.e. easily forwarding intermediate data with each other. Each PE consists of a simple Multiply-and-Accumulate unit and a small scratch pad unit. The input to each PEs can be pipelined in such a way that accesses the rows so that the partial sums flow from top to bottom across each column. Workloads can be optimally mapped into the PEs to support independent execution while reusing the weights or input activations to save main memory bandwidth. Weight stationary or output stationary dataflow can be implemented onto these systolic arrays depending on whichever is efficient in terms of weights or input feature map reuse based on the given AI kernel.

***Fig****:- Google TPU systolic array microarchitecture (****Reference****:-https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu)*

iii) At-Memory Compute:- (Example:- Untether AI Boqueria microarchitecture)

As AI model sizes are increasing exponentially than the pace of hardware development, memory bottleneck is a typical issue in implementing large-scale models. Low on-chip memory capacity hurts both energy efficiency and the latency of the system. Thus, at-memory compute architecture packs more memory very close to compute and the computational units are tightly integrated with the on-chip memory. This allows more weights and input feature map reuse while providing high memory bandwidth, throughput, and low latency.

***Fig****:- Untether AI Bouqeuria microarchitecture (****Reference****:-https://www.servethehome.com/untether-ai-boqueria-1458-risc-v-core-ai-accelerator-hc34/)*

iv) In-Memory Analog Compute:- (Example:- Mythic architecture)

Irrespective of the memory organization, all the previous architectures are structured around digital arithmetic computation where we need an N-bit digital multiplier (for N-bit inputs and weights) and high-precision adders to perform MAC computations, and the worst-case latency for this can be up to O(N) cycles. Such MAC computation can be performed in the analog domain in O(1) time complexity, by driving all lines in the memory macro parallelly. This significantly improves the latency, throughput, and energy efficiency of matrix-vector-multiplication (MVM) operations. To interact with the digital processor, this analog MVM engine needs to be equipped with A/D and D/A converters at the periphery.

As noise and variability are the limitations in analog computing, these architectures are only suitable for edge applications where we need only up to INT8 precision.

***Fig****:- In-memory Analog Computing using peripheral ADCs and DACs (****Reference***:- *https://mythic.ai/technology/analog-computing/*)

Hardware-Software Codesign:-

Most of the AI accelerator architectures are structured around making underlying hardware flexible enough to adapt to workloads while keeping the architecture simple and pushing network mapping complexity to the compiler level. Sambanova’s reconfigurable data unit (RDU) architecture clearly illustrates this where a lot of control overhead is reduced solely at the compiler level so that the workload execution and dataflow can be seamless.

The compiler also addresses several execution optimizations such as data-level parallelism, and model-level parallelism whichever suits the underlying hardware better. Some of the smart compilers also consider both compute cost and the data movement cost within the die or between the dies (in the case of a card/system form factor). Thus, the compiler plays a crucial role in optimally mapping different layers in the AI model to the underlying hardware.

***Fig****:- Sambanova RDU Dataflow Execution (****Reference***:-*https://sambanova.ai/blog/accelerating-scientific-applications-with-sambanova-reconfigurable-dataflow-architecture/*)

Model-level Software Optimization:-

There are several other software-level optimizations to enhance hardware efficiency, which I will briefly mention below.

Quantization:- (https://pytorch.org/docs/stable/quantization.html)

Post-training quantization
Quantization-aware training

Model Compression:-

Sparsity-aware pruning (https://pytorch.org/tutorials/intermediate/pruning_tutorial.html)
Dropout (https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)

Where we are headed:-

The demand for new innovative and energy-efficient processors is at its prime. Thus. it’ll be exciting to see upcoming transformations in this field in both algorithm and computer architecture space that jointly optimize the hardware and software. We might see some of the above processors with optimized ISAs taking off to deliver energy-efficient computations in data centers. All the new products coming in this space should have ~10x energy efficiency, flexible software support, and programmability to stand out in long term over big players such as Nvidia.

Similarly, on the edge, we may see analog compute-in-memory processors, approximate at-memory computing, and neuromorphic computing play a crucial role in battery-powered devices. As A/D, D/A conversions are quite costly, there are other possibilities for analog computing to be optimized by using neuromorphic or stochastic computing principles to perform approximate computing to deliver ~10x-100x energy efficiency. (List of startups active in the neuromorphic computing space includes Brainchip, Syntiant, AiCTX, Rain Neuromorphics). I will cover the state-of-the-art and market opportunities in the neuromorphic space in my next article.

SemiconX:-AI Acceleration Hardware

(Technology Trends Case Study)

List of highest-funded startups that are active in this space: -

Classifications of AI accelerators based on the target market: -

Types of AI Hardware architectures:-

Hardware-Software Codesign:-

Model-level Software Optimization:-

Where we are headed:-

Other reading references:-

More from Gopabandhu Hota

Recommended from Medium

Artificial Intelligence: What’s Really Happening?

On the Other Side of the Uncanny Valley

6 Uses of Chatbots for Learning

A Robotic Thumb Can Change The Way Your Brain Perceives Yourself

Exponential Mindset

Workshop on Convergence of Blockchain with AI and IoT

What are the Applications of AI in Healthcare?

The Importance of Regulating Artificial Intelligence

Get the Medium app

Gopabandhu Hota