Automatic Music Transcription Using CNN + Transformer For Zero-Shot Cross-Domain Performance | International Journal of Science, Engineering and Technology

Authors: Deepak Varadam, Rishi Viswanatha Subramani, Anish Chhetri, Sharan Mathan Mattachotil, Elaizah Foning Ramsong

Abstract: Automatic Music Transcription (AMT) remains a fundamental challenge in Music Information Retrieval (MIR), particularly when generalizing across instruments with divergent acoustic signatures. This paper presents a hybrid deep learning architecture designed to perform polyphonic pitch estimation by leveraging the complementary strengths of Convolutional Neural Networks (CNNs) and Transformers. Our methodology utilizes the Constant-Q Transform (CQT) to provide a musically aligned time-frequency representation, followed by a CNN-based acoustic frontend to extract local spectro-temporal features such as harmonic structures and percussive attacks. These features are subsequently processed by a Transformer Encoder backend, which utilizes multi-head self-attention mechanisms to model long-range temporal dependencies and polyphonic relationships across an 88-key output space.To address the performance gap often observed in cross- domain scenarios, we evaluate the model on the MAESTRO (piano) and GuitarSet (guitar) datasets. Initial results indicate that models trained exclusively on piano data suffer from a significant recall deficit when applied to string instruments due to distinct differences in excitation-resonant patterns. However they perform extremely well in the note identification process and identification of the end of any frame. To mitigate the error caused identifying the start of a f, we propose a joint- training strategy employing artificial oversampling of the smaller GuitarSet corpus to prevent dataset imbalance. Experimental results demonstrate that the proposed hybrid model achieves high F1-scores across both domains, benefiting from the CNN’s local feature extraction and the Transformer’s global context modeling. Furthermore, we provide a detailed computational profiling of the architecture, demonstrating its efficiency for real-time inference applications. The system is deployed as a web-based application that generates standardized sheet music and guitar tablature from raw audio input.

DOI: https://doi.org/10.5281/zenodo.19217634