Optimizing CUDA Kernels For High-Performance Canny Edge Detection

26 Aug

Authors: Harshul Gupta, Rishita Yadav

 

Abstract: We present a comprehensive optimization of the Canny Edge Detection algorithm using NVIDIA CUDA to achieve high-performance GPU acceleration. Our approach integrates warp-level execution, shared memory tiling, memory coalescing, and page-locked transfers to systematically enhance computational efficiency. Evaluated on a subset of the ILSVRC 2017 dataset across six image resolutions (128×128 to 1024×1024), the optimized kernels achieve up to 56× speedup over a baseline OpenCV implementation, with the shared memory version reaching 83.3% compute and memory throughput. Profiling with NVIDIA Nsight Compute reveals that memory transfer bottlenecks remain the primary limitation, but asynchronous transfers and stream concurrency further improve performance. These results demonstrate the potential of advanced CUDA optimization techniques for enabling real-time, high-resolution edge detection in large-scale image processing applications.

DOI: https://doi.org/10.5281/zenodo.16948653