⚡ High-Performance Image Convolution Engine (CUDA)

"635x Faster than CPU." A bare-metal CUDA implementation of Gaussian Blur using Shared Memory Tiling and Memory Coalescing.

🎯 The Engineering Challenge

In high-performance computing (HPC) and Computer Vision, Memory Bandwidth is the bottleneck. A naive GPU implementation reads from VRAM (Global Memory) redundantly, choking the memory controller.

This project implements a Tiled Convolution Kernel that manually manages the L1 Cache (Shared Memory) to reduce Global Memory reads by 90%.

🛠️ Tech Stack

Language: C++ (Host), CUDA C (Device)
Optimization: Shared Memory Tiling, Loop Unrolling
Libraries: None (Built from scratch using stb_image for I/O)
Hardware: Optimized for NVIDIA Pascal/Ampere/Ada Architectures

📊 Performance Benchmarks

Tested on a 4K Image (3840 x 2160 pixels).

Implementation	Execution Time	Speedup vs CPU	Engineering Note
CPU (Single Thread)	947.05 ms	1x	Baseline. Loop-heavy, cache-inefficient.
GPU (Naive Global Mem)	1.79 ms	529x	Massively parallel, but Memory Bound.
GPU (Shared Mem Tiled)	1.49 ms	635x	Optimized. Uses Shared Memory to handle "Halo" pixels.

🧠 Key Optimizations Implemented

1. Shared Memory Tiling

Instead of every thread reading 9 pixels from VRAM (Global Memory), threads in a block cooperate to load a 16x16 Tile (plus border) into Shared Memory once.

Result: Reduced Global Memory transactions by ~9x.

2. Handling the "Halo" (Ghost Cells)

Convolution requires neighbor pixels. Threads at the edge of a block need data from the next block.

Solution: Implemented "Apron" loading logic where edge threads perform double-duty to load the halo into the Shared Memory cache.

3. Memory Coalescing

Input data is loaded in a coalesced manner (consecutive threads read consecutive memory addresses) to maximize memory bus utilization.

🏃 Quick Start

Prerequisites: NVIDIA GPU, CUDA Toolkit 11+.

Clone:

git clone https://github.com/iamartyaa/cuda-convolution.git
cd cuda-convolution

Compile:
```
nvcc cuda_blur.cu -o gpu_blur
```
Run:
```
./gpu_blur
```

📈 Visualizing the Speedup

Run the included Python visualization script to see a simulated side-by-side comparison:

python visualize_video.py

(This renders a real-time comparison window showing the scanline difference between CPU and GPU speeds.)

Author

Amartya Yadav GPU & High-Performance Computing Enthusiast

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
lib		lib
.gitignore		.gitignore
CONCEPTS_EXPLAINED.md		CONCEPTS_EXPLAINED.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
benchmark_final.png		benchmark_final.png
benchmark_plot.py		benchmark_plot.py
check_gpu.cu		check_gpu.cu
cuda_blur.cu		cuda_blur.cu
input.jpg		input.jpg
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ High-Performance Image Convolution Engine (CUDA)

🎯 The Engineering Challenge

🛠️ Tech Stack

📊 Performance Benchmarks

🧠 Key Optimizations Implemented

1. Shared Memory Tiling

2. Handling the "Halo" (Ghost Cells)

3. Memory Coalescing

🏃 Quick Start

📈 Visualizing the Speedup

Author

About

Uh oh!

Releases

Packages

Languages

iamartyaa/cuda-convolution

Folders and files

Latest commit

History

Repository files navigation

⚡ High-Performance Image Convolution Engine (CUDA)

🎯 The Engineering Challenge

🛠️ Tech Stack

📊 Performance Benchmarks

🧠 Key Optimizations Implemented

1. Shared Memory Tiling

2. Handling the "Halo" (Ghost Cells)

3. Memory Coalescing

🏃 Quick Start

📈 Visualizing the Speedup

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages