Building a Real-Time Voicebot with a Microservices Architecture 🗣️
Here's a detailed breakdown of the voicebot's architecture and workflow.
The system is a sophisticated microservices application, with each part—the frontend, the Node.js backend, and the FastAPI backend—specializing in a particular role. This separation of concerns is critical for achieving the high performance and scalability required for a real-time conversational AI.
The Frontend: The Client-Side Experience (React + Vite) 🤳
The React frontend is responsible for the entire user-facing interaction. It's built with a focus on a fluid, real-time experience.
Audio Capture and Processing: The process starts with the user clicking "Start." The useCallSession hook requests microphone access and sets up an AudioContext pipeline. Instead of a basic audio stream, it intelligently uses ScriptProcessorNode to process the incoming audio. A key part of this is the downsampleTo16k utility function, which converts the high-fidelity microphone input (typically 48kHz) to a lower, more efficient 16kHz. This is the optimal sample rate for the Whisper ASR model, significantly reducing the bandwidth and computational load on the backend.
Voice Activity Detection (VAD): The frontend continuously monitors the audio stream for silence. The isSilentutility function calculates the Root Mean Square (RMS) energy of the audio chunks. If the RMS falls below a certain threshold for a defined period (SIL_MS), the frontend assumes the user has finished speaking and triggers a finalizeTurn event. This is crucial for determining when to send the full transcript to the backend for LLM processing, ensuring a natural conversational flow without the user having to press an "end" button.
Dual WebSocket Connections: The frontend uses two separate WebSockets to communicate with the Node.js backend. This is a deliberate architectural choice to keep the data streams clean and decoupled.
UI State Management: The useCallSession hook uses useState and useRef to manage the complex state of the conversation, including the connection status (idle, connecting, in_call, etc.), the live asr transcript, and the final openerand finalText from the AI. The orb component's dynamic styling provides intuitive visual feedback to the user on the current state of the conversation.
The Node.js Backend: The Real-Time Audio Gateway 🎧
The Node.js backend serves as a high-performance audio gateway and a powerful orchestrator for the ASR process. Its non-blocking, event-driven architecture makes it an excellent choice for handling multiple concurrent WebSocket connections.
Connection & Session Management: The server uses the ws library to listen for WebSocket connections on two different paths, /audio and /control. It maintains a clients map that uses a unique client ID (cid) to link the two separate connections from a single user into a single session state object.
ASR Processing Pipeline: This is where the core real-time processing happens.
LLM & SSE Proxying: After generating the final transcript, the Node.js server acts as an intelligent proxy. It makes an HTTP request to the FastAPI backend's /message-stream endpoint. Since this is a Server-Sent Events (SSE) endpoint, the Node.js server receives a stream of events from the LLM pipeline (opener, final, done, error), and for each event, it forwards the data to the corresponding frontend client's /control WebSocket. This real-time proxying ensures a low-latency response for the user.
The FastAPI Backend: The LLM and RAG Brain 🧠
The FastAPI backend, written in Python, is the central "brain" of the voicebot. It's built on a fast, asynchronous framework perfect for orchestrating the LLM calls and retrieval tasks.
Multi-Stage LLM Pipeline: The ChatService orchestrates a sophisticated multi-step LLM workflow. Instead of a single monolithic LLM call, it breaks down the task to optimize for speed and accuracy:
Retrieval-Augmented Generation (RAG): The system uses a Neo4j graph database for RAG.
Session State Management: The SessionService maintains the conversation history in memory using an in-memory dictionary. This allows the LLM to access previous turns in the conversation, which is critical for maintaining context and coherence across multiple user queries.