Voice + Chat Bot — Architecture

Connector shimmer

1. Frontend 2. Gateway 3. Input 4. Query Engine 5. RAG + Data 6. LLM 7. Output 8. Monitoring

React Frontend

Python Backend

Voice / Audio

RAG Pipeline

LLM Engine

Data Sources

Response / Output

Security / Monitoring

Use Play flow to walk the pipeline · click any component for details

Step 1 — User interface layer

React Frontend iFrame Embed

Chat WidgetReact component, hooks, markdown renderer

Voice WidgetMic button, waveform visualizer, VAD UI

iFrame EmbedpostMessage bridge, origin validation

Session StoreZustand / Redux, conversation history

Streaming RendererToken-by-token display, partial SSE flush

Audio PlayerTTS playback, volume, pause controls

Chat widget is an isolated React component. Supports: markdown rendering (react-markdown), typing indicators, file attachment upload, infinite scroll history, and copy-to-clipboard per message. Communicates with backend via WebSocket for real-time streaming responses, and REST for session initialization and history retrieval.

Voice button uses the browser MediaStream API (getUserMedia). Captures PCM audio → encodes to Opus/WAV chunks → sends to the STT endpoint over REST or WebSocket. A Web Audio API AnalyserNode drives the waveform visualizer. Handles mic permission prompts, errors (no mic detected), and browser compatibility (Chrome/Firefox/Safari). Auto-stops on silence using client-side VAD threshold.

The bot is embedded on any third-party website via a single <script> tag that injects a floating iFrame widget. Parent ↔ iFrame communication uses window.postMessage with strict origin whitelist validation. All auth is via short-lived JWT in Authorization headers — no cross-origin cookies. The iFrame auto-resizes based on chat content height. Supports custom theming via CSS variables passed via postMessage config payload.

Global state managed with Zustand (lightweight) or Redux Toolkit (if team prefers). Stores: current session ID, conversation turns array, voice/text mode toggle, user preferences (language, voice speed), and websocket connection status. State is persisted to localStorage so users can resume conversations on page reload.

WebSocket (streaming) + REST (HTTP/HTTPS)

▼

Step 2 — API gateway & security

FastAPI Gateway Python · Uvicorn

JWT AuthIssue, verify, refresh tokens

Rate LimiterRedis sliding window per IP/user

Input SanitizerXSS, SQL injection, prompt injection guard

Session ManagerPostgreSQL primary + Redis cache

CORS HandlerOrigin whitelist, preflight responses

WebSocket ServerFastAPI + Starlette, async handlers

JWT tokens issued on bot initialization. Access tokens: 15 min TTL. Refresh tokens: 7 days. iFrame embed gets an anonymous token with read-only scope. Authenticated users get full history and personalization scope. Token payload includes: user_id, session_id, scope[], iat, exp. Refresh handled silently in the React layer.

Sliding window rate limiting per IP and per user_id stored in Redis. Default limits: 20 requests/minute for anonymous, 60/minute for authenticated. Separate limits for voice uploads (larger payloads). Returns 429 with Retry-After header. Configurable per deployment via environment variables.

Each conversation has a unique session_id (UUID). Session object stores: user_id, last_active timestamp, conversation history (last 20 turns), language preference, input mode (voice/text), and any collected user context (name, intent). PostgreSQL is the source of truth for conversation/session persistence. Redis is used only as a short-lived performance cache (TTL: 30 min idle) to speed up active sessions.

FastAPI WebSocket handler accepts connections at /ws/{session_id}. Authenticates the JWT on connection upgrade. Each message from client triggers the full pipeline (input → query → RAG → LLM) and streams the LLM response tokens back as JSON chunks: {"type":"token","content":"Hello"}. Heartbeat ping/pong every 30s to keep connection alive through proxies.

Dispatched to input processing

▼

Step 3 — Input processing (voice + text pipelines)

Voice Pipeline

STT EngineOpenAI Whisper / Deepgram

Audio PreprocessorNoise reduction, VAD trim

Language DetectorAuto-detect locale from audio

Punctuation RestorerPost-STT transcript cleanup

Audio chunks (16kHz WAV or Opus) sent to OpenAI Whisper API (/audio/transcriptions) or self-hosted Whisper (whisper.cpp or faster-whisper). Deepgram is a lower-latency alternative for real-time streaming STT (Nova-2 model). Output: raw transcript string fed into the unified text pipeline.

Server-side: trim silence from audio clip before STT (ffmpeg or pydub). Client-side: Web Audio API energy threshold to detect speech end. Noise reduction via noisereduce Python library for low-quality mic inputs. Resamples any non-16kHz input to 16kHz before passing to Whisper.

Text Pipeline

Intent ClassifierFAQ / task / chitchat / escalate

Entity ExtractorNames, dates, order IDs, etc.

Language Detectlangdetect / OpenAI multi-lingual

Profanity FilterPre-LLM guard, configurable

Lightweight intent classifier (fine-tuned distilbert or zero-shot with OpenAI) categorizes each message before hitting the full LLM pipeline. Categories: FAQ (retrieve only), Task (tool use needed), Chitchat (direct LLM, no RAG), Escalate (hand off to human agent). Saves tokens and latency for simple intents.

Unified text input → agentic query processor

▼

Step 4 — Agentic query processor

Query Processing Engine Agentic loop

Query RewriterDisambiguate using conversation history

Context InjectorSelect + append relevant history turns

Needs More Info?Clarification agent — ask before searching

Query ClassifierSimple / complex / agentic / tool-use

Memory ManagerShort-term Redis + long-term vector

Source RouterDecide: RAG / direct LLM / tool call

A small prompt sends the last 3 conversation turns + raw user query to the LLM and asks it to produce a clean, self-contained retrieval query. Example: user says "what about the refund policy?" after discussing order cancellation → rewritten to "What is the refund policy for cancelled orders?" This dramatically improves vector DB retrieval accuracy.

Builds the full context window by: (1) taking last N turns from Redis session, (2) running embedding similarity search on the query against long-term memory, (3) appending any collected user profile data (name, preferences), (4) appending tool results from prior steps. Context window budget managed by token counter — trims oldest turns first when near limit.

Before hitting the RAG pipeline, an agent checks: "Do I have enough information to answer this accurately?" If the query is ambiguous (e.g. "check my order" with no order ID), the bot asks a clarifying question instead of hallucinating an answer. Clarification prompts are pre-defined templates to stay on-brand. Max 1 clarification per turn to avoid frustrating the user.

Short-term memory: last 20 turns in Redis (fast, TTL-based). Long-term memory: embeddings of summarized past conversations stored in the vector DB under a user namespace. On session start, retrieve top-3 relevant past sessions and inject a brief summary into system context. Enables "you mentioned last time that…" personalization without bloating every prompt.

Retrieval query dispatched to RAG + data sources

▼

Step 5 — RAG pipeline + data sources

RAG Pipeline

Doc IngestionOffline batch — crawler + uploader

Text ChunkerSemantic / sliding window split

EmbedderOpenAI text-embedding-3-small/large

Vector DBPinecone / ChromaDB / Weaviate

Semantic RetrieverTop-K cosine similarity search

Re-rankerCross-encoder rerank top results

Ingestion pipeline runs offline (scheduled nightly or triggered on content update). Steps: (1) Website crawler (Scrapy / sitemap parse) + document uploader (PDF, DOCX, MD). (2) HTML stripping, deduplication, metadata extraction (title, URL, last-modified). (3) Chunking → embedding → upsert to vector DB. Celery + Redis for async job queue. Supports incremental updates (only re-embed changed documents).

Chunking strategy: 512 token chunks with 50-token overlap (sliding window) for most content. Semantic chunking (split on sentence boundaries / headings) for structured documents. Each chunk stores metadata: source URL, document title, section heading, chunk index, timestamp. LangChain RecursiveCharacterTextSplitter recommended for balanced splits.

OpenAI text-embedding-3-small (1536-dim, cost-efficient) for most deployments. Upgrade to text-embedding-3-large (3072-dim) for higher accuracy on technical content. Custom LLM path: sentence-transformers all-MiniLM-L6-v2 (384-dim) for fully local, zero-cost embedding. Batch embed during ingestion; single embed at query time.

Pinecone (managed, serverless) recommended for production — zero ops overhead. ChromaDB for local dev / self-hosted. Weaviate if you need hybrid (BM25 + vector) search. Each record: {id, vector, metadata: {text, source, title, timestamp}}. Namespaces enable multi-tenant isolation (one namespace per client/website). Index dimension must match embedding model output.

After top-K retrieval (K=15), a cross-encoder re-ranks to select top-5 most relevant chunks. Use Cohere Rerank API (hosted) or cross-encoder/ms-marco-MiniLM-L-6-v2 (local). Re-ranking significantly improves answer quality — the best chunk is not always the closest cosine match. Re-ranked chunks are passed as context to the LLM prompt.

Data Sources

Website ContentCrawler + sitemap + webhooks

Knowledge BaseMarkdown / PDF / DOCX / Notion

External APIsTool-use via function calling

SQL / NoSQL DBProduct, order, user data

File StoreS3 / GCS — PDFs, manuals

Live ScrapersOn-demand fetch for dynamic data

Website crawler (Scrapy or Playwright for JS-rendered pages) runs on schedule. Sitemap.xml used as seed. Pages are deduped by canonical URL. Content change detection via content hash — only re-embed pages that changed since last run. Webhook support: CMS (Contentful, WordPress) can POST to /ingest endpoint to trigger on-demand re-indexing of updated pages.

Tool-use pattern via OpenAI function calling. Tools defined as JSON schemas (name, description, parameters). Examples: get_order_status(order_id), get_product_price(sku), check_availability(product_id, location). Python backend executes the tool call, returns structured result, which is injected back into the LLM context for a final natural-language response. Tools are scoped per deployment config.

Retrieved context chunks + rewritten query → LLM engine

▼

Step 6 — LLM engine

LLM Engine OpenAI GPT-4o / Custom LLM

Prompt BuilderSystem prompt + context + query assembly

Token CounterContext overflow guard + trim strategy

LLM CallerOpenAI / vLLM / Ollama via LiteLLM

Streaming HandlerSSE / WebSocket chunk forwarding

GuardrailsHallucination check + safety filter

Relevance JudgeIs answer on-topic? Loop or respond?

Prompt template (versioned in YAML config): [System persona + bot rules] + [Top-5 retrieved context chunks with source labels] + [Last N conversation turns] + [Rewritten user query]. System prompt defines: bot name/persona, response language, response format (concise/verbose), citation style, fallback instruction ("if unsure, say so — don't guess"), and tool descriptions. Prompt versioned and A/B testable.

tiktoken (for OpenAI models) counts tokens before each call. Budget allocation: 512 for system prompt, 1500 for context chunks, 800 for conversation history, remainder for response. If over budget: first trim oldest history turns, then trim lowest-ranked context chunks. Never trim system prompt or user query.

LiteLLM wrapper abstracts the LLM provider — swap OpenAI for any model without changing business logic. Primary: OpenAI GPT-4o (best quality). Cost fallback: GPT-4o-mini or GPT-3.5-turbo. Custom LLM: vLLM server (GPU) with fine-tuned Mistral-7B or Llama-3. Fully local: Ollama with llama3 or phi-3. Model routing configurable per intent type (e.g. FAQ → cheap model, complex task → GPT-4o).

Post-generation checks: (1) Citation check — does the answer use facts from the provided context, or is it hallucinating? (2) Scope check — is the response about the configured topic domain? (3) Safety filter — does it contain harmful, offensive, or off-brand content? (4) Confidence check — if LLM expresses uncertainty, trigger clarification or human escalation. Guardrails implemented via a second LLM call with a classification prompt, or rule-based regex for cost efficiency.

After guardrails: a lightweight relevance judge evaluates "Does this response actually answer the user's query?" using a binary classification prompt. If NO → loops back to the Query Processor with a "needs_more_context" flag, which triggers a deeper RAG retrieval or a different source. Max 2 retry loops per turn to prevent infinite loops. After 2 failures → fallback response + human escalation offer.

↩ If relevance check fails → loops back to Step 4 with "needs_more_context" flag (max 2 retries per turn)

Validated response dispatched to output pipeline

▼

Step 7 — Response & output pipeline

Output Pipeline

Response FormatterMarkdown → HTML or plain text

TTS EngineElevenLabs (primary voice provider)

Audio EncoderMP3 / Opus streaming chunks

Source CitationsAppend source doc references

Stream ChunkerFlush tokens live via WebSocket

Fallback HandlerEscalate to human agent trigger

Text responses: rendered as Markdown in the chat widget (code blocks, bold, lists, links). Voice responses: stripped of Markdown before TTS (no reading "asterisk asterisk" aloud). Response includes metadata: source citations array, confidence score, session_id, turn_id. Citation format: each claim linked to its source chunk URL for transparency.

Voice mode: text is NOT sent to TTS as a full block — instead split into sentences (~50–80 words each) and TTS-encoded sentence-by-sentence to minimize first-audio latency. ElevenLabs (Turbo v2.5 or selected ElevenLabs voice model) is used as the production voice provider. Audio is streamed back as base64-encoded MP3 chunks via WebSocket, and the React audio player queues and plays chunks in order.

LLM response is streamed token-by-token from the OpenAI API (stream=True) and forwarded immediately via WebSocket to the React client. Client renders tokens as they arrive — user sees text appearing in real-time, not waiting for full response. Each WebSocket message: {"type":"token","content":"word"}. Final message: {"type":"done","citations":[...],"turn_id":"..."}

Escalation triggers: (1) LLM confidence too low after 2 retries, (2) user explicitly asks for a human, (3) sensitive topics detected (legal, medical, billing disputes), (4) consecutive negative feedback signals. Escalation: sends handoff payload (full conversation history + user contact info) to a human agent queue (Zendesk / Intercom / custom queue). Bot informs user: "I'm connecting you with a team member."

Delivered to React client via WebSocket stream

▼

Step 8 — Monitoring, logging & analytics

Observability Stack

Conversation LoggerFull turn history → PostgreSQL

Latency TrackerPer-step P50/P95 timing

Error MonitorSentry / Datadog alerts

Token UsageCost per session / per day

Feedback Capture👍 👎 ratings + free text

Analytics DashboardCommon queries, drop-off, coverage

Every conversation turn logged with: session_id, user_id, timestamp, raw input, rewritten query, retrieved chunk IDs, LLM model used, response text, latency_ms (per pipeline step), token counts (prompt + completion), guardrail results, and feedback score. Stored in PostgreSQL. Used for: debugging, prompt tuning, KB gap analysis, compliance.

Token usage tracked per session and aggregated per day/client. LLM cost = (prompt_tokens × input_price + completion_tokens × output_price). STT cost = audio_minutes × whisper_rate. TTS cost = ElevenLabs characters × configured ElevenLabs rate. Dashboard shows cost trends and alerts on anomalies (e.g. runaway loops generating 10x normal tokens).

Thumbs up/down widget shown after each bot response. Negative feedback triggers: immediate flag in the logs, optional free-text "what went wrong?" prompt, and adds the turn to a review queue for KB improvement. Positive feedback is used to identify high-confidence responses that can be cached or used as fine-tuning examples. Feedback data feeds the weekly bot quality review.

Complete Tech Stack Reference

Frontend: React 18, Zustand, react-markdown, WebSocket API, Web Audio API, MediaStream API, Vite

Backend: Python 3.11+, FastAPI, Uvicorn, Celery, PostgreSQL (primary DB), Redis (cache), SQLAlchemy

STT (Speech-to-Text): OpenAI Whisper API, faster-whisper (self-hosted), Deepgram Nova-2 (streaming)

TTS (Text-to-Speech): ElevenLabs Turbo v2.5 (primary)

LLM: OpenAI GPT-4o / GPT-4o-mini, LiteLLM wrapper, vLLM (custom), Ollama (local)

RAG Framework: LangChain / LlamaIndex, text-embedding-3-small, Pinecone / ChromaDB

Re-ranking: Cohere Rerank API, cross-encoder/ms-marco-MiniLM (local)

iFrame / Embed: Vanilla JS loader script, postMessage bridge, dynamic iFrame injection

Auth / Security: JWT (python-jose), bcrypt, Redis rate limiter, CORS, input sanitization

Ingestion Queue: Celery + Redis broker, Scrapy / Playwright crawler, Pandas for preprocessing

Monitoring: Sentry, Prometheus + Grafana, custom analytics DB, LangSmith (LLM tracing)

Deploy: Docker, Nginx reverse proxy, Kubernetes (prod), GitHub Actions CI/CD

Voice + Chat Website Bot