MiniTPU

Inspiration

Modern LLMs suck. They're slow, they paywall you all the time, and they make you feel like you're burning down the Amazon rainforest every time you send a prompt. Current consumer hardware isn't really built to handle local LLMs very well either, as most people really only have CPUs with zero purpose-built AI acceleration. Thus, we decided to build a small, power efficient Tensor Processing Unit to handle neural processing tasks.

What it does

The MiniTPU performs matrix multiplication using a systolic array of processing elements (PEs). Each PE repeatedly multiplies an A value and a B value and accumulates the result iinto a running sum. Over multiple cycles, each PE computes one output of the matrix multiplication. The data then gets streamlined into the arrays each cycle and outputs complete diagonals as the pipeline fills and flushes through the ReLU and out the GPIO pins.

How we built it

Architecture We built a 4x4 systolic array of PEs, each PE contaains registers for inputs A and B, a multiplier, and partial sum register. It is also equipped with forwarding paths that pass A to the right and B downward through correct wiring.
Dataflow 2 matrices, each is 4x4. The A values enter from the left edge and keep moving to the right. The B values enter from the top edge and keep moving down. Every 4 clock cycles, the input buffer triggers a “send” so the next wave of values ripples through the mesh.

The input buffer collects inputs first, and it fills the 4 slots needed for both the 4 values for the rows (A side), and the 4 values for the columns (B side). The very first values we send are a11 and b11. After that, the rest of A and B get injected in a "diagonal" order, and those values propagate through the PE grid while each PE keeps accumulating.

After the first PE is done computing, the data from that PE will get flushed to the RoLU and emptied. Then, there will be a flushing and reset cycle every 4 clocks following diaogonal patterns. The output of the matric multiplication will be tracked in the output and sent out when all 7 diagonals of that PE is flushed.

RTL and verification The design is written in Verilog/SysVerilog with clear module boundaries: PE, array,Fsms, and wiring. We use cocotb to generate test cases and check the output. We used waveform debugging through modelsim and gtkwave.