LangDoc: A Hybrid AI-Powered Framework for Multi-lingual Document Understanding in Non-Searchable Visual Formats
Authors: Aman Verma, Prince Aaryan, Mr. Ankur Kaushik
Abstract: Multilingual document processing presents considerable challenges, particularly when content exists in non-searchable visual formats across diverse scripts and languages. Traditional optical character recognition (OCR) systems frequently encounter cascading errors in such complex environments, limiting accessibility and automated comprehension . The problem intensifies for languages with unique orthographies or those considered low-resource, where digital tools and datasets remain scarce . This document introduces LangDoc, a hybrid artificial intelligence (AI)-powered framework engineered to overcome these linguistic and accessibility barriers. LangDoc deviates from conventional "flat" OCR systems by adopting a novel Script-First approach. This methodology prioritizes accurate visual script identification as an initial, critical step, mitigating error propagation in subsequent processing stages . The architecture integrates a fine-tuned YOLOv8 model for robust visual script identification, dynamically routing to a specialized Tesseract OCR engine for precise text extraction . For multilingual interpretation, the system employs the M2M100 Many-to-Many Transformer, enabling direct translation across over 100 languages . Furthermore, the incorporation of Google Gemini 2.5 Flash augments the framework with context-aware reasoning and a conversational interface, facilitating interactive document comprehension . Experimental evaluations demonstrate significant reductions in Word Error Rate (WER) and superior Bilingual Evaluation Understudy (BLEU) scores, particularly for regional Indian languages, thereby validating the efficacy of this integrated approach.
DOI: http://doi.org/
Phishing Site URL Detection System
Authors: Khushi Agarwal, Vimal Kartik, Professor Shubhi Verma
Abstract: The rapid growth of internet usage has led to a significant rise in phishing attacks, posing serious threats to user security and data privacy. Traditional detection methods, such as blacklist-based approaches, are often ineffective against newly emerging and sophisticated phishing websites. This research presents a Machine Learning-Based Phishing Website URL Detection System designed to identify malicious URLs in real time with high accuracy. The proposed system utilizes multiple machine learning algorithms and extracts key features from URL structures and domain characteristics to effectively distinguish between legitimate and phishing websites. It is integrated with a user-friendly web interface that enables instant URL analysis and prediction. Experimental results demonstrate that the system achieves reliable performance across diverse scenarios, providing a scalable and efficient solution for enhancing web security. The proposed approach reduces reliance on traditional methods and offers proactive protection against evolving phishing threats.
DOI: http://doi.org/
International Journal of Science, Engineering and Technology