Authors: Aman Verma, Prince Aaryan, Mr. Ankur Kaushik
Abstract: Multilingual document processing presents considerable challenges, particularly when content exists in non-searchable visual formats across diverse scripts and languages. Traditional optical character recognition (OCR) systems frequently encounter cascading errors in such complex environments, limiting accessibility and automated comprehension . The problem intensifies for languages with unique orthographies or those considered low-resource, where digital tools and datasets remain scarce . This document introduces LangDoc, a hybrid artificial intelligence (AI)-powered framework engineered to overcome these linguistic and accessibility barriers. LangDoc deviates from conventional "flat" OCR systems by adopting a novel Script-First approach. This methodology prioritizes accurate visual script identification as an initial, critical step, mitigating error propagation in subsequent processing stages . The architecture integrates a fine-tuned YOLOv8 model for robust visual script identification, dynamically routing to a specialized Tesseract OCR engine for precise text extraction . For multilingual interpretation, the system employs the M2M100 Many-to-Many Transformer, enabling direct translation across over 100 languages . Furthermore, the incorporation of Google Gemini 2.5 Flash augments the framework with context-aware reasoning and a conversational interface, facilitating interactive document comprehension . Experimental evaluations demonstrate significant reductions in Word Error Rate (WER) and superior Bilingual Evaluation Understudy (BLEU) scores, particularly for regional Indian languages, thereby validating the efficacy of this integrated approach.
DOI: http://doi.org/
International Journal of Science, Engineering and Technology