KERneL

Inspiration

Optimizing code for GPUs is notoriously difficult despite its great potential in AI training and high-performance computing. Writing efficient CUDA kernels requires deep expertise, making GPU acceleration inaccessible to many developers. We were inspired by the idea of democratizing GPU programming—leveraging LLMs to generate optimized GPU kernels from high-level descriptions. Our goal? Lower the barrier of entry, so that anyone can harness the power of NVIDIA GPUs without needing years of CUDA experience.

What it does

KERneL is an AI-powered kernel generation tool that takes high-level code descriptions and transforms them into optimized CUDA kernels. By leveraging large language models (LLMs), KERneL automates the process of writing efficient GPU code, allowing developers to: ✅ Generate CUDA kernels from simple prompts ✅ Optimize performance with AI-assisted tuning ✅ Reduce the learning curve for GPU acceleration

With KERneL, more developers can tap into GPU computing for AI, graphics, and scientific simulations effortlessly.

How we built it

Backend: We utilized NVIDIA Build Cloud to integrate LLMs like Qwen 2.5 7B and DeepSeek R1 for kernel preprocessing and generation, as well as OpenAI’s state-of-the-art tools for additional kernel optimization.

Frontend: A user-friendly interface built with Streamlit allows seamless interaction for generating, testing, and refining CUDA kernels.

Cloud Computing: Hosted on Brev.dev Cloud with H100 GPUs for high-performance compute tasks, integrated through an ngrok tunnel for secure access.

Challenges we ran into

Backend-frontend integration: Coordinating data flow between the Flask API, compute instances, and the NVIDIA Build Cloud was complex.

Kernel validation: Ensuring the generated CUDA kernels met performance expectations required extensive testing and refinement.

LLM prompt engineering: Guiding the language models to understand PyTorch timing details and compiler logs was time-intensive.

Accomplishments that we're proud of

Achieves lower latency compared to a variety of PyTorch Dynamo compiled models
Provide great access to beginners interested in fully leveraging FLOP utilization
Successful generation and compilation of CUDA kernel code

We fixed the challenges! Above all, we are proud that we cooperatively finished the project and helped each others out to augment our skillsets.