Modular Software Implementation of an Edge Voice Chatbot: Solving Latency and Privacy through Concurrency and Adaptive Listening
Pysyvä osoite
Kuvaus
Opinnäytetyö kokotekstinä PDF-muodossa.
Historically, conversational artificial intelligence systems primarily relied on continuous, cloud-based processing architectures to handle intensive computational workloads. While effective, these traditional paradigms frequently introduced substantial communication latency and raised critical concerns regarding data privacy and security. In recent years, a paradigm shift toward local edge computing emerged as a viable alternative. By deploying advanced language models and voice processing applications directly onto consumer-grade hardware, these privacy and latency issues were significantly mitigated. Consequently, the development of natural, highly responsive, and secure spoken communication with machines was facilitated.
The primary objective of this research was the development and implementation of a fully local, modular voice chatbot application utilizing the Python programming language. The proposed system integrated a sophisticated pipeline comprising Speech-to-Text transcription, a Large Language Model, and Text-to-Speech synthesis. To ensure practical viability, the architecture was implemented and evaluated on consumer-grade edge hardware. Specifically, an Nvidia RTX 3060 GPU equipped with six gigabytes of video random access memory served as the practical minimum baseline to run this local pipeline simultaneously without memory failure. During the initial phase of the research, a baseline sequential pipeline utilized a rigid five-second recording loop, inadvertently forcing the transcription model to process ambient room silence. This generated severe artificial intelligence hallucinations, characterized by the transcription of nonsensical text and repeating numerical sequences. Consequently, this continuous processing of silence severely degraded the language model's context window and precipitated severe Out-of-Memory (OOM) crashes, critically undermining conversational naturalness.
To resolve these system failures, the architecture evolved through a modular approach.
Phase 2 implemented adaptive listening by integrating a lightweight Voice Activity Detection module. Executing entirely on the central processing unit to preserve GPU memory, this module dynamically terminated recording after approximately 640 milliseconds of consecutive human silence. In Phase 3, the system was upgraded to a concurrent, multithreaded pipeline. The monolithic loop was decoupled into isolated, asynchronous threads communicating via maxsize queues to eliminate sequential processing bottlenecks.
Ultimately, the integration of these methodologies yielded highly successful results, eliminating the silence-induced hallucinations and memory crashes. The final multithreaded deployment operated stably with a steady-state memory footprint of approximately 3.3 gigabytes of VRAM, successfully proving the viability of deploying highly responsive, natural, and low-latency conversational artificial intelligence locally on constrained edge hardware.
