Virtual Canvas: A Dual-Pipeline Benchmark Of MediaPipe And YOLOv11-Pose

12 May

Authors: Jatin Jain, Dr. Sakshi Indolia

Abstract: This paper presents Virtual Canvas, a real-time touchless drawing application that simultaneously executes two hand-pose estimation pipelines — Google MediaPipe Hands and a pre-trained YOLOv11-Pose model — on every captured webcam frame. The dual-pipeline architecture eliminates input variance between models, enabling a controlled side-by-side experimen- tal comparison under identical real-world conditions. Across 11 sessions spanning 10 days of live evaluation on CPU-only hardware, 1,391 performance samples were captured at 500 ms intervals via automated CSV logging, covering inference latency, frames per second, CPU utilisation, hand detection counts, and lighting conditions. Results demonstrate that MediaPipe achieves a 2.35× lower mean inference latency than YOLOv11-Pose (t = −120.68, p < 0.0001, Cohen’s d = 4.58). Under dim lighting, YOLOv11-Pose inference variance increased by 133% while MediaPipe remained stable, though MediaPipe latency itself rose by 18.2%. YOLOv11-Pose exhibited systematic over- detection, reporting two hands in 81.6% of single-hand frames. Exponential Moving Average (EMA) smoothing (α = 0.35) and 5-frame gesture debouncing enabled fluid drawing interaction despite sub-5 FPS dual-pipeline throughput. The system provides a practical, data-driven benchmarking framework for selecting between lightweight pre-trained detectors and heavier single- stage models in human-computer interaction applications.

DOI: https://doi.org/10.5281/zenodo.20135316