Researchers at MIT’s Han Lab have introduced LEGO, a groundbreaking framework that behaves like a compiler for AI chips. This innovation automatically generates synthesizable Register Transfer Level (RTL) code for spatial accelerators, using tensor workloads as input. Unlike existing methods that either analyze dataflows without generating hardware or rely on hand-tuned templates with fixed topologies, LEGO can target any dataflow and combinations, generating both architecture and RTL from a high-level description.
Key Components of LEGO
1. Input IR: Affine, Relation-Centric Semantics (Deconstruct)
LEGO models tensor programs as loop nests with three index classes: temporal, spatial, and computation. It uses two affine relations to drive the compiler: data mapping (fI→D) and dataflow mapping (fTS→If). This affine-only representation simplifies reuse detection and address generation to a linear-algebra problem. Moreover, LEGO decouples control flow from dataflow, enabling shared control across functional units (FUs) and reducing control logic overhead.
2. Front End: FU Graph + Memory Co-Design (Architect)
LEGO’s front end aims to maximize reuse and on-chip bandwidth while minimizing interconnect and multiplexer (mux) overhead. It formulates reuse as solving linear systems over the affine relations to discover direct and delay (FIFO) connections between FUs. It then computes minimum-spanning arborescences to keep only necessary edges, reducing FIFO depth. For multiple dataflows, a Breadth-First Search (BFS)-based heuristic rewrites direct interconnects, prioritizing chain reuse and nodes already fed by delay connections to cut muxes and data nodes. LEGO also computes bank counts per tensor dimension and instantiates data-distribution switches to route between banks and FUs. Dataflow fusion combines interconnects for different spatial dataflows into a single FU-level Architecture Description Graph (ADG).
3. Back End: Compile & Optimize to RTL (Compile & Optimize)
The ADG is lowered to a Detailed Architecture Graph (DAG) of primitives. LEGO applies several Linear Programming (LP) and graph passes to optimize the design. Delay matching via LP chooses output delays to minimize inserted pipeline registers, meeting timing alignment with minimal storage. Broadcast pin rewiring converts expensive broadcasts into forward chains, enabling register sharing and lower latency. Reduction tree extraction and pin reuse transform sequential adder chains into balanced trees, reducing logic depth and register count. These passes focus on the datapath, which dominates resources, producing ~35% area savings versus naïve generation.
Outcome and Importance
LEGO is implemented in C++ with HiGHS as the LP solver and emits SpinalHDL→Verilog. Evaluated across foundation models and classic CNNs/Transformers, LEGO’s generated hardware shows 3.2× speedup and 2.4× energy efficiency over Gemmini under matched resources. For researchers, LEGO provides a mathematically grounded path from loop-nest specifications to spatial hardware with provable LP-based optimizations. For practitioners, it enables hardware-as-code, supporting multi-op pipelines without manual template redesign. For product leaders, LEGO lowers the barrier to custom silicon, enabling task-tuned, power-efficient edge accelerators that keep pace with fast-moving AI stacks.
How LEGO Works Step-by-Step
1. Deconstruct (Affine IR): Write the tensor op as loop nests and supply affine data mapping (fI→D), dataflow mapping (fTS→If), and control flow vector (c). This specifies what to compute and how it is spatialized, without templates.
2. Architect (Graph Synthesis): Solve reuse equations to discover FU interconnects (direct/delay), compute minimum-spanning arborescences for minimal edges and fused dataflows, and compute banked memory and distribution switches to satisfy concurrent accesses without conflicts.
3. Compile & Optimize (LP + Graph Transforms): Lower to a primitive DAG, run delay-matching LP, broadcast rewiring, reduction-tree extraction, and pin-reuse ILP, perform bit-width inference, and apply optional power gating. These passes deliver ~35% area and ~28% energy savings versus naïve codegen.
Ecosystem Positioning
Compared to analysis tools (Timeloop/MAESTRO) and template-bound generators (Gemmini, DNA, MAGNET), LEGO is template-free, supports any dataflow and their combinations, and emits synthesizable RTL. Results show comparable or better area/power versus expert handwritten accelerators under similar dataflows and technologies, while offering one-architecture-for-many-models deployment.
Summary
LEGO operationalizes hardware generation as compilation for tensor programs: an affine front end for reuse-aware interconnect/memory synthesis and an LP-powered back end for datapath minimization. The framework’s measured 3.2× performance and 2.4× energy gains over a leading open generator, plus ~35% area reductions from back-end optimizations, position it as a practical path to application-specific AI accelerators at the edge and beyond.