Convolution Engine

Inspiration

The idea for this project originated from concepts learned in MAT290 and ECE216, where we studied how one signal affects another through convolution. In theory, convolution describes how an input signal is transformed by a system. We wanted to take that mathematical concept and physically implement it in hardware, translating continuous-time and discrete-time convolution into a real-time digital image processing engine.

What it does

This project implements a 3×3 streaming convolution engine in Verilog. It accepts pixel data sequentially, generates valid 3×3 windows using a multi-line buffer architecture, and performs convolution with a programmable kernel matrix.

The design supports zero-padding to preserve image dimensions (128x128), allowing proper edge handling. When tested with different kernels (e.g edge detection, blur), we were able to clearly observe the visual transformation of the input image, demonstrating the physical effect of the kernel matrix on the signal.

How we built it

1. Buffer Module

The Buffer module is responsible for storing incoming pixel data and generating valid 3×3 sliding windows.
Implemented multi-line buffering to store previous rows of pixel data.
Designed row and column counters to track streaming pixel position.
Generated valid 3×3 window outputs once sufficient pixels were received.
Integrated zero-padding logic to handle image boundaries without distorting output dimensions.
Managed synchronization between streaming input and window-valid signaling. This module enables continuous streaming operation without storing the entire image in memory, significantly reducing hardware resource usage.

2. Convolve Module

The Convolve module performs the actual 3×3 convolution computation.
Implemented signed multipliers to support kernels containing negative coefficients.
Carefully handled signed vs. unsigned arithmetic and bit-width management to prevent overflow and - incorrect sign extension.
Computed the weighted sum of the 3×3 window and kernel matrix to produce the output pixel.
Generated output-valid signaling aligned with window availability. This module translates the mathematical convolution operation directly into hardware arithmetic.

3. Top-Level Project Module

The top-level module integrates the Buffer and Convolve modules into a complete streaming pipeline.
Routed streaming pixel input into the buffer subsystem.
Connected window outputs to the convolution datapath.
Managed global control signals including reset, enable, and clock.
Ensured proper timing alignment between window generation and convolution output. The overall system processes pixels sequentially, triggering convolution only when a valid 3×3 window is formed.

Challenges we ran into

One of the biggest challenges was implementing correct padding logic while maintaining streaming behavior. Managing row/column boundaries and inserting zeros at the right time required careful synchronization with the line buffers.

Another major issue involved signed vs. unsigned arithmetic in Verilog. Since convolution kernels often contain negative coefficients, we had to carefully manage bit widths and sign extension to prevent incorrect results.

Simulation also posed challenges, particularly with clock synchronization mismatches and timing behavior between modules. Debugging required stepping through waveform outputs to verify window formation, valid signals, and arithmetic correctness.

Accomplishments that we're proud of

We successfully implemented a fully functional streaming convolution engine that produces correct output images. Seeing the visible effect of different kernel matrices (such as edge detection) on the processed image was a strong validation of both the mathematical theory and our hardware implementation.

Most importantly, we translated a concept from signals and systems into a working RTL design suitable for ASIC constraints.

What we learned

Verilog Design

On the design side, we gained hands-on experience implementing convolution as a hardware datapath rather than just a mathematical operation.
Learned how to design a streaming architecture using multi-line buffers instead of storing a full image in memory.
Implemented row/column tracking and boundary-aware zero-padding logic.
Managed signed vs. unsigned arithmetic in Verilog, including correct sign extension and bit-width control for negative kernel coefficients.
Understood how control logic and datapath must be synchronized to ensure valid window generation and correct output timing. This strengthened our understanding of hardware dataflow, modular RTL design, and how theoretical signal processing maps directly into structured digital logic.

Verification & Simulation

On the verification side, we learned how to simulate and validate RTL designs using Python and Cocotb.
Wrote Python-based testbenches to generate clock signals, apply reset sequences, and drive input stimulus.
Programmatically set input values and control signals to test different scenarios.
Used assertions to verify expected output behavior.
Debugged timing mismatches and synchronization issues using waveform analysis.
Identified and resolved simulation issues related to clock alignment and signal validity. This experience helped us understand how important verification is in hardware development and how automated simulation workflows are used to validate digital designs before fabrication.

What's next for Convolution Engine

Future improvements could include:

Parameterizable kernel sizes and image sizes
Customizable kernel properties
Higher throughput pipeline architecture
On-chip memory optimization
Support for multi-channel (RGB) image processing

Built With

verilog

Updates

M1NT Minh Le started this project — Feb 22, 2026 03:18 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.