View
React Frontend
Python Backend
Voice / Audio
RAG Pipeline
LLM Engine
Data Sources
Response / Output
Security / Monitoring
Use Play flow to walk the pipeline · click any component for details
Step 1 — User interface layer
React Frontend
iFrame Embed
Chat WidgetReact component, hooks, markdown renderer
Voice WidgetMic button, waveform visualizer, VAD UI
iFrame EmbedpostMessage bridge, origin validation
Session StoreZustand / Redux, conversation history
Streaming RendererToken-by-token display, partial SSE flush
Audio PlayerTTS playback, volume, pause controls
Chat widget is an isolated React component. Supports: markdown rendering (react-markdown), typing indicators, file attachment upload, infinite scroll history, and copy-to-clipboard per message. Communicates with backend via WebSocket for real-time streaming responses, and REST for session initialization and history retrieval.
Voice button uses the browser MediaStream API (getUserMedia). Captures PCM audio → encodes to Opus/WAV chunks → sends to the STT endpoint over REST or WebSocket. A Web Audio API AnalyserNode drives the waveform visualizer. Handles mic permission prompts, errors (no mic detected), and browser compatibility (Chrome/Firefox/Safari). Auto-stops on silence using client-side VAD threshold.
The bot is embedded on any third-party website via a single <script> tag that injects a floating iFrame widget. Parent ↔ iFrame communication uses window.postMessage with strict origin whitelist validation. All auth is via short-lived JWT in Authorization headers — no cross-origin cookies. The iFrame auto-resizes based on chat content height. Supports custom theming via CSS variables passed via postMessage config payload.
Global state managed with Zustand (lightweight) or Redux Toolkit (if team prefers). Stores: current session ID, conversation turns array, voice/text mode toggle, user preferences (language, voice speed), and websocket connection status. State is persisted to localStorage so users can resume conversations on page reload.
WebSocket (streaming) + REST (HTTP/HTTPS)
▼
Step 2 — API gateway & security
FastAPI Gateway
Python · Uvicorn
JWT AuthIssue, verify, refresh tokens
Rate LimiterRedis sliding window per IP/user
Input SanitizerXSS, SQL injection, prompt injection guard
Session ManagerPostgreSQL primary + Redis cache
CORS HandlerOrigin whitelist, preflight responses
WebSocket ServerFastAPI + Starlette, async handlers
JWT tokens issued on bot initialization. Access tokens: 15 min TTL. Refresh tokens: 7 days. iFrame embed gets an anonymous token with read-only scope. Authenticated users get full history and personalization scope. Token payload includes: user_id, session_id, scope[], iat, exp. Refresh handled silently in the React layer.
Sliding window rate limiting per IP and per user_id stored in Redis. Default limits: 20 requests/minute for anonymous, 60/minute for authenticated. Separate limits for voice uploads (larger payloads). Returns 429 with Retry-After header. Configurable per deployment via environment variables.
Each conversation has a unique session_id (UUID). Session object stores: user_id, last_active timestamp, conversation history (last 20 turns), language preference, input mode (voice/text), and any collected user context (name, intent). PostgreSQL is the source of truth for conversation/session persistence. Redis is used only as a short-lived performance cache (TTL: 30 min idle) to speed up active sessions.
FastAPI WebSocket handler accepts connections at /ws/{session_id}. Authenticates the JWT on connection upgrade. Each message from client triggers the full pipeline (input → query → RAG → LLM) and streams the LLM response tokens back as JSON chunks: {"type":"token","content":"Hello"}. Heartbeat ping/pong every 30s to keep connection alive through proxies.
Dispatched to input processing
▼
Step 3 — Input processing (voice + text pipelines)
Voice Pipeline
STT EngineOpenAI Whisper / Deepgram
Audio PreprocessorNoise reduction, VAD trim
Language DetectorAuto-detect locale from audio
Punctuation RestorerPost-STT transcript cleanup
Audio chunks (16kHz WAV or Opus) sent to OpenAI Whisper API (/audio/transcriptions) or self-hosted Whisper (whisper.cpp or faster-whisper). Deepgram is a lower-latency alternative for real-time streaming STT (Nova-2 model). Output: raw transcript string fed into the unified text pipeline.
Server-side: trim silence from audio clip before STT (ffmpeg or pydub). Client-side: Web Audio API energy threshold to detect speech end. Noise reduction via noisereduce Python library for low-quality mic inputs. Resamples any non-16kHz input to 16kHz before passing to Whisper.
Text Pipeline
Intent ClassifierFAQ / task / chitchat / escalate
Entity ExtractorNames, dates, order IDs, etc.
Language Detectlangdetect / OpenAI multi-lingual
Profanity FilterPre-LLM guard, configurable
Lightweight intent classifier (fine-tuned distilbert or zero-shot with OpenAI) categorizes each message before hitting the full LLM pipeline. Categories: FAQ (retrieve only), Task (tool use needed), Chitchat (direct LLM, no RAG), Escalate (hand off to human agent). Saves tokens and latency for simple intents.
Unified text input → agentic query processor
▼
Step 4 — Agentic query processor
Query Processing Engine
Agentic loop
Query RewriterDisambiguate using conversation history
Context InjectorSelect + append relevant history turns
Needs More Info?Clarification agent — ask before searching
Query ClassifierSimple / complex / agentic / tool-use
Memory ManagerShort-term Redis + long-term vector
Source RouterDecide: RAG / direct LLM / tool call
A small prompt sends the last 3 conversation turns + raw user query to the LLM and asks it to produce a clean, self-contained retrieval query. Example: user says "what about the refund policy?" after discussing order cancellation → rewritten to "What is the refund policy for cancelled orders?" This dramatically improves vector DB retrieval accuracy.
Builds the full context window by: (1) taking last N turns from Redis session, (2) running embedding similarity search on the query against long-term memory, (3) appending any collected user profile data (name, preferences), (4) appending tool results from prior steps. Context window budget managed by token counter — trims oldest turns first when near limit.
Before hitting the RAG pipeline, an agent checks: "Do I have enough information to answer this accurately?" If the query is ambiguous (e.g. "check my order" with no order ID), the bot asks a clarifying question instead of hallucinating an answer. Clarification prompts are pre-defined templates to stay on-brand. Max 1 clarification per turn to avoid frustrating the user.
Short-term memory: last 20 turns in Redis (fast, TTL-based). Long-term memory: embeddings of summarized past conversations stored in the vector DB under a user namespace. On session start, retrieve top-3 relevant past sessions and inject a brief summary into system context. Enables "you mentioned last time that…" personalization without bloating every prompt.
Retrieval query dispatched to RAG + data sources
▼
Step 5 — RAG pipeline + data sources
RAG Pipeline
Doc IngestionOffline batch — crawler + uploader
Text ChunkerSemantic / sliding window split
EmbedderOpenAI text-embedding-3-small/large
Vector DBPinecone / ChromaDB / Weaviate
Semantic RetrieverTop-K cosine similarity search
Re-rankerCross-encoder rerank top results
Ingestion pipeline runs offline (scheduled nightly or triggered on content update). Steps: (1) Website crawler (Scrapy / sitemap parse) + document uploader (PDF, DOCX, MD). (2) HTML stripping, deduplication, metadata extraction (title, URL, last-modified). (3) Chunking → embedding → upsert to vector DB. Celery + Redis for async job queue. Supports incremental updates (only re-embed changed documents).
Chunking strategy: 512 token chunks with 50-token overlap (sliding window) for most content. Semantic chunking (split on sentence boundaries / headings) for structured documents. Each chunk stores metadata: source URL, document title, section heading, chunk index, timestamp. LangChain RecursiveCharacterTextSplitter recommended for balanced splits.
OpenAI text-embedding-3-small (1536-dim, cost-efficient) for most deployments. Upgrade to text-embedding-3-large (3072-dim) for higher accuracy on technical content. Custom LLM path: sentence-transformers all-MiniLM-L6-v2 (384-dim) for fully local, zero-cost embedding. Batch embed during ingestion; single embed at query time.
Pinecone (managed, serverless) recommended for production — zero ops overhead. ChromaDB for local dev / self-hosted. Weaviate if you need hybrid (BM25 + vector) search. Each record: {id, vector, metadata: {text, source, title, timestamp}}. Namespaces enable multi-tenant isolation (one namespace per client/website). Index dimension must match embedding model output.
After top-K retrieval (K=15), a cross-encoder re-ranks to select top-5 most relevant chunks. Use Cohere Rerank API (hosted) or cross-encoder/ms-marco-MiniLM-L-6-v2 (local). Re-ranking significantly improves answer quality — the best chunk is not always the closest cosine match. Re-ranked chunks are passed as context to the LLM prompt.
Data Sources
Website ContentCrawler + sitemap + webhooks
Knowledge BaseMarkdown / PDF / DOCX / Notion
External APIsTool-use via function calling
SQL / NoSQL DBProduct, order, user data
File StoreS3 / GCS — PDFs, manuals
Live ScrapersOn-demand fetch for dynamic data
Website crawler (Scrapy or Playwright for JS-rendered pages) runs on schedule. Sitemap.xml used as seed. Pages are deduped by canonical URL. Content change detection via content hash — only re-embed pages that changed since last run. Webhook support: CMS (Contentful, WordPress) can POST to /ingest endpoint to trigger on-demand re-indexing of updated pages.
Tool-use pattern via OpenAI function calling. Tools defined as JSON schemas (name, description, parameters). Examples: get_order_status(order_id), get_product_price(sku), check_availability(product_id, location). Python backend executes the tool call, returns structured result, which is injected back into the LLM context for a final natural-language response. Tools are scoped per deployment config.
Retrieved context chunks + rewritten query → LLM engine
▼
Step 6 — LLM engine
LLM Engine
OpenAI GPT-4o / Custom LLM
Prompt BuilderSystem prompt + context + query assembly
Token CounterContext overflow guard + trim strategy
LLM CallerOpenAI / vLLM / Ollama via LiteLLM
Streaming HandlerSSE / WebSocket chunk forwarding
GuardrailsHallucination check + safety filter
Relevance JudgeIs answer on-topic? Loop or respond?
Prompt template (versioned in YAML config): [System persona + bot rules] + [Top-5 retrieved context chunks with source labels] + [Last N conversation turns] + [Rewritten user query]. System prompt defines: bot name/persona, response language, response format (concise/verbose), citation style, fallback instruction ("if unsure, say so — don't guess"), and tool descriptions. Prompt versioned and A/B testable.
tiktoken (for OpenAI models) counts tokens before each call. Budget allocation: 512 for system prompt, 1500 for context chunks, 800 for conversation history, remainder for response. If over budget: first trim oldest history turns, then trim lowest-ranked context chunks. Never trim system prompt or user query.
LiteLLM wrapper abstracts the LLM provider — swap OpenAI for any model without changing business logic. Primary: OpenAI GPT-4o (best quality). Cost fallback: GPT-4o-mini or GPT-3.5-turbo. Custom LLM: vLLM server (GPU) with fine-tuned Mistral-7B or Llama-3. Fully local: Ollama with llama3 or phi-3. Model routing configurable per intent type (e.g. FAQ → cheap model, complex task → GPT-4o).
Post-generation checks: (1) Citation check — does the answer use facts from the provided context, or is it hallucinating? (2) Scope check — is the response about the configured topic domain? (3) Safety filter — does it contain harmful, offensive, or off-brand content? (4) Confidence check — if LLM expresses uncertainty, trigger clarification or human escalation. Guardrails implemented via a second LLM call with a classification prompt, or rule-based regex for cost efficiency.
After guardrails: a lightweight relevance judge evaluates "Does this response actually answer the user's query?" using a binary classification prompt. If NO → loops back to the Query Processor with a "needs_more_context" flag, which triggers a deeper RAG retrieval or a different source. Max 2 retry loops per turn to prevent infinite loops. After 2 failures → fallback response + human escalation offer.
↩ If relevance check fails → loops back to Step 4 with "needs_more_context" flag (max 2 retries per turn)
Validated response dispatched to output pipeline
▼
Step 7 — Response & output pipeline
Output Pipeline
Response FormatterMarkdown → HTML or plain text
TTS EngineElevenLabs (primary voice provider)
Audio EncoderMP3 / Opus streaming chunks
Source CitationsAppend source doc references
Stream ChunkerFlush tokens live via WebSocket
Fallback HandlerEscalate to human agent trigger
Text responses: rendered as Markdown in the chat widget (code blocks, bold, lists, links). Voice responses: stripped of Markdown before TTS (no reading "asterisk asterisk" aloud). Response includes metadata: source citations array, confidence score, session_id, turn_id. Citation format: each claim linked to its source chunk URL for transparency.
Voice mode: text is NOT sent to TTS as a full block — instead split into sentences (~50–80 words each) and TTS-encoded sentence-by-sentence to minimize first-audio latency. ElevenLabs (Turbo v2.5 or selected ElevenLabs voice model) is used as the production voice provider. Audio is streamed back as base64-encoded MP3 chunks via WebSocket, and the React audio player queues and plays chunks in order.
LLM response is streamed token-by-token from the OpenAI API (stream=True) and forwarded immediately via WebSocket to the React client. Client renders tokens as they arrive — user sees text appearing in real-time, not waiting for full response. Each WebSocket message: {"type":"token","content":"word"}. Final message: {"type":"done","citations":[...],"turn_id":"..."}
Escalation triggers: (1) LLM confidence too low after 2 retries, (2) user explicitly asks for a human, (3) sensitive topics detected (legal, medical, billing disputes), (4) consecutive negative feedback signals. Escalation: sends handoff payload (full conversation history + user contact info) to a human agent queue (Zendesk / Intercom / custom queue). Bot informs user: "I'm connecting you with a team member."
Delivered to React client via WebSocket stream
▼
Step 8 — Monitoring, logging & analytics
Observability Stack
Conversation LoggerFull turn history → PostgreSQL
Latency TrackerPer-step P50/P95 timing
Error MonitorSentry / Datadog alerts
Token UsageCost per session / per day
Feedback Capture👍 👎 ratings + free text
Analytics DashboardCommon queries, drop-off, coverage
Every conversation turn logged with: session_id, user_id, timestamp, raw input, rewritten query, retrieved chunk IDs, LLM model used, response text, latency_ms (per pipeline step), token counts (prompt + completion), guardrail results, and feedback score. Stored in PostgreSQL. Used for: debugging, prompt tuning, KB gap analysis, compliance.
Token usage tracked per session and aggregated per day/client. LLM cost = (prompt_tokens × input_price + completion_tokens × output_price). STT cost = audio_minutes × whisper_rate. TTS cost = ElevenLabs characters × configured ElevenLabs rate. Dashboard shows cost trends and alerts on anomalies (e.g. runaway loops generating 10x normal tokens).
Thumbs up/down widget shown after each bot response. Negative feedback triggers: immediate flag in the logs, optional free-text "what went wrong?" prompt, and adds the turn to a review queue for KB improvement. Positive feedback is used to identify high-confidence responses that can be cached or used as fine-tuning examples. Feedback data feeds the weekly bot quality review.
Complete Tech Stack Reference
Frontend: React 18, Zustand, react-markdown, WebSocket API, Web Audio API, MediaStream API, Vite
Backend: Python 3.11+, FastAPI, Uvicorn, Celery, PostgreSQL (primary DB), Redis (cache), SQLAlchemy
STT (Speech-to-Text): OpenAI Whisper API, faster-whisper (self-hosted), Deepgram Nova-2 (streaming)
TTS (Text-to-Speech): ElevenLabs Turbo v2.5 (primary)
LLM: OpenAI GPT-4o / GPT-4o-mini, LiteLLM wrapper, vLLM (custom), Ollama (local)
RAG Framework: LangChain / LlamaIndex, text-embedding-3-small, Pinecone / ChromaDB
Re-ranking: Cohere Rerank API, cross-encoder/ms-marco-MiniLM (local)
iFrame / Embed: Vanilla JS loader script, postMessage bridge, dynamic iFrame injection
Auth / Security: JWT (python-jose), bcrypt, Redis rate limiter, CORS, input sanitization
Ingestion Queue: Celery + Redis broker, Scrapy / Playwright crawler, Pandas for preprocessing
Monitoring: Sentry, Prometheus + Grafana, custom analytics DB, LangSmith (LLM tracing)
Deploy: Docker, Nginx reverse proxy, Kubernetes (prod), GitHub Actions CI/CD