StreamFIR — A Streaming FIR Filter Hardware Accelerator
Motivation
Many signal processing workloads are fundamentally streaming problems.
A CPU processes signals sequentially using instructions, memory loads, and branching. This introduces latency, wastes power, and limits throughput.
Filtering, however, is not a control problem — it is a repeated mathematical operation.
The core question of this ASIC hackathon is:
Where does hardware beat a CPU?
We chose digital filtering as a minimal but clear demonstration of hardware acceleration.
What the Project Does
StreamFIR is a 4-tap real-time FIR (Finite Impulse Response) filter implemented entirely in synthesizable Verilog.
The module continuously receives 8-bit samples and outputs a filtered signal every clock cycle.
Supported operating modes:
- Bypass — direct signal output
- Moving Average — smoothing filter
- Weighted Low-Pass — noise reduction
- High-Pass / Edge Detection — transition detection
- User-Programmable Coefficients — custom filter behavior
The filter operates as a streaming hardware accelerator rather than a program.
Hardware Architecture
The design is a fully pipelined streaming datapath:
1. Delay Line
Stores the last four input samples and shifts every clock cycle.
2. Parallel Multiply-Accumulate (MAC)
Each sample is multiplied by a coefficient simultaneously.
[ y[n] = c_0x[n] + c_1x[n-1] + c_2x[n-2] + c_3x[n-3] ]
All multiplications occur in parallel in hardware.
3. Mode Controller
Selects preset filter behaviors or user-defined coefficients.
4. Register Interface
External logic can configure the filter without recompilation.
The pipeline produces one output sample per clock cycle.
Why Hardware Beats a CPU
CPU Implementation
- Instruction fetch
- Memory access
- Sequential multiply operations
- Scheduling overhead
StreamFIR Hardware
- All multiplications happen simultaneously
- No instruction overhead
- Deterministic latency
- Continuous processing
Result:
| Metric | CPU | StreamFIR |
|---|---|---|
| Throughput | Limited | 1 sample / cycle |
| Latency | Variable | Constant |
| Power Efficiency | Lower | Higher |
| Timing | Non-deterministic | Deterministic |
This project demonstrates that streaming DSP workloads map naturally to hardware pipelines.
Verification
We implemented a full ASIC-style verification flow:
- Cocotb Python testbench
- Directed tests for each mode
- Random input testing
- Impulse response testing
- Mode-switching validation
- Gate-level simulation
Waveforms were analyzed using GTKWave to confirm functional correctness.
Challenges
- Handling signed arithmetic in RTL
- Aligning pipeline latency with expected outputs
- Debugging gate-level vs behavioral mismatches
What We Learned
- Hardware parallelism vs CPU sequential execution
- Streaming datapath design
- FIR filter implementation in RTL
- ASIC verification workflow
- Why DSP is a classic hardware accelerator domain
Future Work
- More taps (8-tap / 16-tap filters)
- Audio input interface (I2S / ADC)
- Equalizer implementation
- Cascaded FIR/IIR filters
- Real-time audio processing
Built With
- Verilog
- TinyTapeout SKY130 PDK
- Icarus Verilog
- Cocotb
- Python
- GTKWave
Built With
- cocotb
- icarus-verilog
- python
- tinytapeout-sky130-pdk
- verilog
Log in or sign up for Devpost to join the conversation.