Authors: Sriram Ghanta
Abstract: Growing demands for instant model decisions inside Java based services have exposed a persistent gap between modern machine learning workloads and the latency boundaries expected in production environments. To address this challenge, the work investigates how targeted ONNX Runtime optimization can reshape inference behavior on Java platforms where thread scheduling, memory pressure, and JNI transitions often limit responsiveness. A mixed methods research design was adopted that blends quantitative benchmarking of diverse model architectures with qualitative examination of execution traces and runtime diagnostics captured under variable system load. The methodological approach incorporates controlled experiments, iterative tuning cycles, operator level adjustments, and session configuration refinement to capture both performance gains and behavioral patterns. The findings reveal that carefully structured ONNX Runtime adjustments can reduce inference delays, stabilize throughput, and deliver predictable real time behavior without redesigning models or altering overarching service architecture. The contribution lies in establishing an applied tuning framework that is reproducible, environment aware, and directly actionable for Java engineers deploying ML capabilities in microservices, event streaming pipelines, or edge-oriented systems. Beyond its technical value, the work highlights how runtime centric optimization can strengthen operational reliability in industries that require rapid and consistent decision responses, offering a strategic pathway for integrating machine intelligence into latency constrained JVM ecosystems.
International Journal of Science, Engineering and Technology