Q&A System (RAG)¶
Overview¶
This project implements a production-grade Retrieval-Augmented Generation (RAG) system for intelligent question-answering over a corpus of 35 political speeches (300,000+ words). The system combines semantic search, keyword matching, and large language models to provide accurate, well-sourced answers to natural language questions.
What It Does:
- Answers natural language questions about political speech content
- Retrieves relevant context from a 35-speech corpus
- Generates AI-powered answers with source citations
- Provides confidence scoring and explainability
- Extracts and analyzes entities mentioned in queries
Perfect For:
- Political speech research
- Policy position analysis
- Comparative speech analysis
- Entity-specific question answering
Architectural Highlights:
- Modular Design: Separated concerns with dedicated components for search, confidence, entities, and document loading
- Testable: 65%+ test coverage with component-level unit tests
- Type-Safe: Pydantic models for all RAG data structures
- Maintainable: Clear separation of concerns, easy to extend and debug
Installation & Setup¶
Prerequisites¶
Python Version: 3.11 or 3.12 (as specified in pyproject.toml)
Package Manager: This project uses uv for dependency management.
Quick Start¶
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/JustaKris/Trump-Rally-Speeches-NLP-Chatbot.git
cd Trump-Rally-Speeches-NLP-Chatbot
# Install dependencies (creates .venv automatically)
uv sync
# Configure environment
cp .env.example .env
# Edit .env: Set LLM_API_KEY and LLM_PROVIDER
# Run the server
uv run uvicorn speech_nlp.app:app --reload
API available at http://localhost:8000.
Dependencies¶
Core RAG dependencies (automatically installed with uv sync):
chromadb>=0.5.0— Vector database for embeddingssentence-transformers>=3.3.0— MPNet embeddings (768d)langchain>=0.3.0— Document chunking utilitiesrank-bm25>=0.2.2— BM25 keyword searchgoogle-generativeai>=0.8.0— Gemini LLM (default)
Optional LLM Providers:
# Install OpenAI support
uv sync --group llm-openai
# Install Claude support
uv sync --group llm-anthropic
Set LLM_PROVIDER=openai or LLM_PROVIDER=anthropic in .env after installing.
System Architecture¶
Core Components¶
Orchestration:
RAGService(services/rag/service.py) - Manages ChromaDB collection and coordinates components
Specialized Services (services/rag/):
SearchEngine(search_engine.py) - Hybrid search with semantic, BM25, and cross-encoder rerankingRAGGuardrails(guardrails.py) - Three-layer pipeline protection: query validation, relevance filtering, grounding verificationConfidenceCalculator(confidence.py) - Multi-factor confidence scoringEntityAnalyzer(entity_analyzer.py) - Entity extraction, sentiment, co-occurrence analysisQueryRewriter(query_rewriter.py) - LLM-powered query optimisation for improved search retrievalDocumentLoader(document_loader.py) - Semantic chunking with embedding-based topic boundary detection; extracts structured metadata (location, date, year) from speech filenames
Supporting Services:
GeminiLLM(services/llm_service.py) - Answer generation with Google Gemini
Core Architecture¶
Vector Database¶
- ChromaDB with persistent storage
- MPNet embeddings (768 dimensions) for semantic understanding
- Efficient querying with smart deduplication
Search Engine¶
- Hybrid search combining dense embeddings with BM25 sparse retrieval
- Cross-encoder reranking for precision optimization
- Configurable weights for semantic vs keyword balance
- Deduplication removes duplicate results by ID
LLM Integration¶
- Pluggable LLM Providers: Gemini (default), OpenAI GPT, or Anthropic Claude
- Configuration: Via
LLM_PROVIDERandLLM_API_KEYenvironment variables - Model Selection: Configurable via
LLM_MODEL_NAME(e.g.,gemini-2.0-flash-exp,gpt-4o-mini,claude-3-5-sonnet-20241022) - Context-aware prompting: Entity-focused generation for targeted queries
- Fallback extraction: Works without LLM (extraction-based answers)
Advanced Features¶
- Multi-factor confidence scoring
- Entity extraction and analytics
- Sentiment analysis for entities
- Co-occurrence analysis
- Source attribution with citations
Key Features¶
1. Intelligent Question Answering¶
Ask natural language questions and receive AI-generated answers with supporting evidence.
Example:
Response includes:
- Generated answer from Gemini
- 5 supporting context chunks
- Confidence score with explanation
- Source document attribution
- Entity statistics (if applicable)
2. Multi-Factor Confidence Scoring¶
Sophisticated confidence assessment handled by ConfidenceCalculator component.
Confidence Factors (weighted):
- Retrieval Quality (40%) — Semantic similarity of retrieved chunks
- Consistency (25%) — Low variance in scores = higher confidence
- Coverage (20%) — Number of supporting chunks (normalized 0-1)
- Entity Coverage (15%) — For entity queries, mention frequency
Confidence Levels:
- High: combined_score ≥ 0.7
- Medium: 0.4 ≤ combined_score < 0.7
- Low: combined_score < 0.4
Example output:
{
"confidence": "high",
"confidence_score": 0.87,
"confidence_explanation": "Overall confidence is HIGH (score: 0.87) based on excellent semantic match (similarity: 0.91), very consistent results (consistency: 0.93), 5 supporting context chunks",
"confidence_factors": {
"retrieval_score": 0.91,
"consistency": 0.93,
"chunk_coverage": 5,
"entity_coverage": 0.84
}
}
3. Entity Analytics & Confidence Explainability¶
The EntityAnalyzer component and confidence system provide transparency into how the system works.
Confidence Explanation¶
Every answer includes a human-readable explanation of why it has a certain confidence level:
Example:
"Overall confidence is MEDIUM (score: 0.59) based on weak semantic match (similarity: 0.22), very consistent results (consistency: 1.00), 5 supporting context chunks, 'Biden' mentioned in all retrieved chunks."
What It Explains:
- Retrieval quality (semantic similarity)
- Result consistency (variance in scores)
- Coverage (number of supporting chunks)
- Entity coverage (for entity queries)
Entity Detection & Statistics¶
Automatic entity detection with comprehensive analytics:
Features:
- Mention counts — How many times entity appears across entire corpus
- Speech coverage — Which specific speeches mention the entity
- Corpus percentage — Percentage of documents containing entity
- Sentiment analysis — Average sentiment toward entity using FinBERT
- Analyzes up to 50 chunks containing the entity
- Converts scores to -1 (negative) to +1 (positive)
- Classifies as Positive, Neutral, or Negative
- Co-occurrence analysis — Most common terms appearing near entity
- Extracts words from contexts containing entity
- Filters stopwords
- Returns top 5 associated terms
Example output:
{
"entity_statistics": {
"Biden": {
"mention_count": 524,
"speech_count": 30,
"corpus_percentage": 25.03,
"speeches": ["OhioSep21_2020.txt", "BemidjiSep18_2020.txt", ...],
"sentiment": {
"average_score": -0.61,
"classification": "Negative",
"sample_size": 50
},
"associations": ["socialism", "weakness", "failure", "china", "corrupt"]
}
}
}
Use Cases:
- Research: "How often is Biden mentioned in these speeches?"
- Sentiment tracking: "What's the average sentiment about Biden?"
- Context discovery: "What topics are associated with healthcare?"
- Coverage analysis: "Which speeches mention climate change?"
4. Hybrid Search¶
SearchEngine component combines semantic and keyword search for optimal retrieval:
- Semantic search — Dense embeddings capture meaning and context (MPNet 768d)
- BM25 keyword search — Ensures exact term matches aren't missed
- Score combination — Configurable weights (default: 0.7 semantic, 0.3 BM25)
- Cross-encoder reranking — Optional final precision optimization
- Deduplication — Removes duplicate results by ID
Search Modes:
semantic- Pure vector similarityhybrid- Combined semantic + BM25 (default)reranking- Adds cross-encoder pass
5. Semantic Document Chunking¶
DocumentLoader component implements custom embedding-based semantic chunking — not LangChain's off-the-shelf splitter.
How It Works:
- Sentence tokenisation — NLTK splits document text into individual sentences
- Embedding — Each sentence is embedded with the same MPNet model used for search
- Similarity scoring — Cosine similarity is computed between consecutive sentence embeddings
- Breakpoint detection — Sentences where similarity drops below a percentile-based threshold (default: 90th percentile) mark topic boundaries
- Group merging — Sentences between breakpoints are merged into coherent chunks; groups smaller than
semantic_min_chunk_sizeare folded into their neighbour - Overflow splitting — Any group exceeding
chunk_sizefalls back toRecursiveCharacterTextSplitterso no chunk is ever too large
Why It Matters:
Fixed-size chunking cuts mid-paragraph, mid-sentence, even mid-word. Semantic chunking produces chunks that represent complete ideas — each chunk covers a coherent topic segment. This directly improves retrieval quality because the embedding for a coherent chunk is more meaningful than one for an arbitrary text slice.
Results: ~2,354 semantically coherent chunks from 35 speeches (vs ~1,082 with fixed 2048-char chunking).
Configuration:
rag:
chunking_strategy: "semantic" # "semantic" or "fixed"
semantic_min_chunk_size: 256 # Merge groups smaller than this
semantic_breakpoint_percentile: 90.0 # Percentile for topic-shift detection
# semantic_similarity_threshold: null # Override percentile with absolute threshold
6. Three-Layer RAG Guardrails¶
The RAGGuardrails component prevents hallucination and ensures answer quality through a three-layer protection pipeline.
Layer 1 — Pre-Retrieval Query Validation:
Rejects queries before any search is performed:
- Empty or whitespace-only queries
- Queries shorter than 3 characters
When triggered, returns a structured "no information" response immediately — no wasted compute on search/LLM.
Layer 2 — Post-Retrieval Relevance Filtering:
After search results are returned (and cross-encoder reranked), each result's relevance score is checked against a configurable threshold.
The scoring pipeline:
- Cross-encoder (
ms-marco-MiniLM-L-6-v2) produces raw logits for each query–document pair - Logits are sigmoid-normalised to a 0–1 probability:
score = 1 / (1 + exp(-logit)) - Results below the threshold are dropped
The service fetches 2× the requested top_k candidates to provide filtering headroom. If all results fall below the threshold, the system returns "I don't have enough information" rather than passing irrelevant context to the LLM.
Threshold Calibration
The default threshold (0.01) is calibrated for the ms-marco-MiniLM-L-6-v2 cross-encoder on political speech transcripts. This model produces very negative logits (−3 to −7) for broad topical queries, making sigmoid values small. A threshold of 0.01 (logit ≈ −4.6) filters true noise while preserving the cross-encoder's relative ranking for legitimate queries. Tune via similarity_threshold in your config.
Layer 3 — Post-Generation Grounding Verification:
After the LLM generates an answer, a token-overlap heuristic checks whether the answer content is grounded in the retrieved context:
- Extract content words from the answer (stop-word filtered, ~70+ common English words removed)
- Extract content words from all retrieved context chunks
- Compute overlap ratio:
grounding_score = |answer_words ∩ context_words| / |answer_words| - If the score falls below
grounding_threshold(default 0.3), append a caveat warning
Refusal phrases ("I don't have enough information") always pass the grounding check.
Response Metadata:
Every response includes guardrails metadata:
{
"guardrails": {
"enabled": true,
"triggered": false,
"relevance_filtered": 3,
"grounding_score": 0.72,
"grounding_passed": true
}
}
Configuration:
rag:
guardrails_enabled: true # Enable/disable all guardrails
similarity_threshold: 0.01 # Min sigmoid-normalised relevance score
grounding_threshold: 0.3 # Min token-overlap for grounding check
7. LLM-Powered Query Rewriting¶
The QueryRewriter component optimises user queries before search to improve retrieval quality.
How It Works:
- User submits a natural language question
- The LLM rewrites it to be more search-friendly (fix typos, expand abbreviations, add synonyms)
- The rewritten query is used for semantic search
- The original query is preserved for entity extraction and LLM answer generation
Design Decisions:
- Deterministic rewrites: Uses
temperature=0.0to ensure consistent results - Safety guards: Empty-query passthrough, disabled passthrough via config, error fallback to original query, suspiciously-long-rewrite rejection (>5× original length)
- Separation of concerns: Rewritten query drives search; original query drives entity analysis and answer generation. This preserves user intent while improving retrieval.
Example:
| Original Query | Rewritten Query |
|---|---|
"wut did trump say bout the wall" |
"What did Trump say about the border wall and immigration?" |
"economy" |
"What were the economic policies and economy-related topics discussed?" |
"What was said about China?" |
"What was said about China?" (unchanged — already optimal) |
Response Metadata:
Every response includes query rewriting metadata when active:
{
"query_rewriting": {
"enabled": true,
"original_query": "wut about the wall",
"rewritten_query": "What did Trump say about the border wall and immigration?"
}
}
Configuration:
8. Extended Chunk Metadata¶
Each speech filename encodes the rally location and date. The extract_speech_metadata() function in DocumentLoader parses this automatically during document loading, enriching every chunk with structured metadata.
Filename Pattern: {Location}{MonthDay}_{Year}.txt
Extracted Fields:
| Field | Type | Example |
|---|---|---|
location |
str |
"Battle Creek" |
year |
int |
2019 |
month |
int |
12 |
day |
int |
19 |
date |
str |
"2019-12-19" (ISO format) |
Edge Cases Handled:
- CamelCase multi-word locations:
BattleCreek→"Battle Creek",LasVegas→"Las Vegas" - Hyphenated locations:
Winston-Salem→ preserved as"Winston-Salem" - Filenames that don't match the pattern: metadata is omitted gracefully (no error)
How Metadata Flows Through the Pipeline:
- Document loading —
extract_speech_metadata()parses the filename; fields are merged into each chunk's metadata dict alongsidesource,chunk_index,total_chunks - Vector storage — ChromaDB stores the enriched metadata; available for future metadata filtering
- Search results —
ContextChunk.from_search_result()propagateslocation,date,yearfrom search result metadata - LLM context — Source labels in the prompt include location and date (e.g.,
[Source 1: file.txt, Part 3, Battle Creek, 2019-12-19]) - API response — The
contextlist in each response includeslocation,date, andyearfor every chunk
Example API response context entry:
{
"text": "The economy is doing tremendously well...",
"source": "BattleCreekDec19_2019.txt",
"chunk_index": 3,
"score": 0.82,
"location": "Battle Creek",
"date": "2019-12-19",
"year": 2019
}
API Usage¶
Basic Question¶
# cURL
curl -X POST "http://localhost:8000/rag/ask" \
-H "Content-Type: application/json" \
-d '{"question": "What was said about the economy?", "top_k": 5}'
# Python
import requests
response = requests.post(
"http://localhost:8000/rag/ask",
json={"question": "What was said about the economy?", "top_k": 5}
)
result = response.json()
print(result["answer"])
print(f"Confidence: {result['confidence']} ({result['confidence_score']:.2f})")
Entity Query¶
response = requests.post(
"http://localhost:8000/rag/ask",
json={"question": "What did Trump say about Biden?", "top_k": 10}
)
result = response.json()
# View entity statistics
if "entity_statistics" in result:
for entity, stats in result["entity_statistics"].items():
print(f"\n{entity}:")
print(f" Mentions: {stats['mention_count']}")
print(f" Sentiment: {stats['sentiment']['classification']}")
print(f" Associated: {', '.join(stats['associations'][:3])}")
Semantic Search¶
response = requests.post(
"http://localhost:8000/rag/search",
json={"query": "immigration policy", "top_k": 5}
)
results = response.json()["results"]
for i, result in enumerate(results, 1):
print(f"\n{i}. Source: {result['source']}")
print(f" Similarity: {result['similarity']:.3f}")
print(f" Preview: {result['text'][:100]}...")
Configuration¶
Environment Variables¶
# .env
LLM_API_KEY=your-api-key-here
LLM_PROVIDER=gemini # Options: gemini, openai, anthropic
LLM_MODEL_NAME=gemini-2.0-flash-exp
# Alternative: OpenAI
# LLM_API_KEY=sk-your-openai-key
# LLM_PROVIDER=openai
# LLM_MODEL_NAME=gpt-4o-mini
# Alternative: Claude
# LLM_API_KEY=sk-ant-your-key
# LLM_PROVIDER=anthropic
# LLM_MODEL_NAME=claude-3-5-sonnet-20241022
RAGService Parameters¶
from speech_nlp.services.rag.service import RAGService
rag = RAGService(
collection_name="speeches",
persist_directory="./data/chromadb",
embedding_model="all-mpnet-base-v2", # 768d embeddings
reranker_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
chunk_size=2048, # ~512-768 tokens
chunk_overlap=150, # ~100-150 tokens
llm_service=llm_service, # Pluggable LLM provider
use_reranking=True, # Enable cross-encoder
use_hybrid_search=True, # Enable BM25 + semantic
query_rewriting_enabled=True, # Enable LLM query rewriting
)
Note: Hybrid search weights (semantic_weight, keyword_weight) are configured in the SearchEngine component, not at the service level.
Component Initialization¶
The RAG service automatically initializes all components:
# Initialized internally:
# - DocumentLoader (for chunking)
# - SearchEngine (for hybrid retrieval)
# - RAGGuardrails (for three-layer protection)
# - QueryRewriter (for LLM query optimisation)
# - ConfidenceCalculator (for scoring)
# - EntityAnalyzer (for entity extraction)
# - GeminiLLM (for answer generation, if use_llm=True)
API Endpoint Configuration¶
- Default
top_k: 5 chunks - Maximum
top_k: 15 chunks - Increase for complex/entity queries
Performance¶
First Request¶
- ~30-60 seconds (model downloads + document indexing)
- Downloads ~1-2 GB of models (one-time)
Subsequent Requests¶
- ~1-3 seconds for typical queries
- ~2-5 seconds for entity analytics (sentiment analysis)
Optimization Opportunities¶
- Cache entity statistics
- Pre-compute embeddings
- Async sentiment analysis
- Redis for query caching
Technical Details¶
Models Used¶
- Embeddings:
sentence-transformers/all-mpnet-base-v2(768d) - Reranker:
cross-encoder/ms-marco-MiniLM-L-6-v2 - LLM: Google Gemini 2.5 Flash
- Sentiment: ProsusAI/finbert
Database¶
- ChromaDB 0.5.0 with SQLite persistence
- Vector index: HNSW for efficient similarity search
- Metadata filtering: Source, chunk index, timestamps
Prompt Engineering¶
- Context-limited to 4000 characters max
- Source attribution in context
- Entity-focused instructions when entities detected
- Structured output format
- Safety settings for political content
Limitations & Future Work¶
Current Limitations¶
- Entity extraction uses simple heuristics (capitalization)
- Sentiment analysis may show neutral for complex political text
- No query caching (every request recomputes)
- Synchronous processing (no async optimization)
Future Enhancements¶
- Integrate proper NER (spaCy or Hugging Face)
- Add query caching layer (Redis)
- Implement async processing
- Add temporal analysis (sentiment over time)
- Entity relationship graphs
- Fine-tune embeddings on domain data
Data Migration¶
If you're upgrading from a previous version with different embeddings:
This migration script will:
- Clear existing ChromaDB collection
- Reload documents with new embeddings
- Re-index all 35 speeches (~1082 chunks)
- Verify indexing completed successfully
When to run migration:
- After changing embedding models
- After updating ChromaDB version
- After modifying chunk size/overlap settings
- When experiencing search quality issues
Development Workflow¶
Running Tests¶
# Run RAG service tests
uv run pytest tests/test_rag_integration.py -v
# Run search engine tests
uv run pytest tests/test_search_engine.py -v
# Run entity analyzer tests
uv run pytest tests/test_entity_analyzer.py -v
# Run confidence calculator tests
uv run pytest tests/test_confidence.py -v
# Run all RAG-related tests with coverage
uv run pytest tests/test_*rag*.py tests/test_*search*.py tests/test_*entity*.py tests/test_*confidence*.py --cov=speech_nlp.services.rag
Code Quality¶
# Lint and format
uv run ruff check src/speech_nlp/services/rag/
uv run ruff format src/speech_nlp/services/rag/
# Type checking
uv run mypy src/speech_nlp/services/rag/
Local Development¶
# Run with hot reload
uv run uvicorn speech_nlp.app:app --reload --log-level debug
# Test RAG endpoint manually
curl -X POST http://localhost:8000/rag/ask \
-H "Content-Type: application/json" \
-d '{"question": "What was said about immigration?", "top_k": 5}'
# Check RAG statistics
curl http://localhost:8000/rag/stats
Debugging Tips¶
Enable verbose logging:
Inspect retrieved chunks:
from speech_nlp.services.rag.service import RAGService
rag = RAGService()
results = rag.search("immigration policy", top_k=5)
for i, result in enumerate(results, 1):
print(f"\n{i}. {result['source']} (similarity: {result['similarity']:.3f})")
print(f" {result['text'][:200]}...")
See Also¶
- Sentiment Analysis — Multi-model emotion and sentiment detection
- Topic Analysis — AI-powered topic extraction with semantic clustering
- Architecture — System architecture overview
- Configuration — Complete configuration reference
- API Documentation — Interactive API docs (Azure Free Tier: allow 1-5min cold start)
- Quickstart Guide — Local setup instructions
- Deployment Guide — Production deployment
- Testing Guide — Testing practices
- GitHub Repository — Source code