Generative AI Models for Intelligent Data Quality Assessment and Integrity Repair in Big Data Engineering Workflows

15 Dec

Authors: Anika Deshpande, Vikram Chauhan, Priya Nair, Vasudev Sharma

Abstract: The rapid expansion of big data engineering pipelines has intensified challenges related to data quality, consistency, and trustworthiness across distributed and heterogeneous environments. Traditional rule-based validation and manual remediation techniques struggle to scale with the volume, velocity, and structural diversity of modern data workflows. This study proposes a generative AI–driven framework for intelligent data quality assessment and automated integrity repair within big data engineering ecosystems. By leveraging large language models and sequence-to-sequence generative architectures, the framework enables contextual anomaly detection, semantic consistency validation, and adaptive correction of structural, syntactic, and referential data defects. The approach integrates seamlessly into batch and streaming pipelines, supporting proactive monitoring and self-healing mechanisms across ingestion, transformation, and storage layers. Experimental evaluations demonstrate measurable improvements in data completeness, accuracy, and lineage consistency while significantly reducing manual intervention and remediation latency. The findings highlight the potential of generative AI to transform data quality management from reactive validation to autonomous, intelligence-driven governance, positioning it as a foundational capability for next-generation data engineering platforms.

DOI: https://doi.org/10.5281/zenodo.17937528