Q&A System (RAG)¶

Overview¶

This project implements a production-grade Retrieval-Augmented Generation (RAG) system for intelligent question-answering over a corpus of 35 political speeches (300,000+ words). The system combines semantic search, keyword matching, and large language models to provide accurate, well-sourced answers to natural language questions.

What It Does:

Answers natural language questions about political speech content
Retrieves relevant context from a 35-speech corpus
Generates AI-powered answers with source citations
Provides confidence scoring and explainability
Extracts and analyzes entities mentioned in queries

Perfect For:

Political speech research
Policy position analysis
Comparative speech analysis
Entity-specific question answering

Architectural Highlights:

Modular Design: Separated concerns with dedicated components for search, confidence, entities, and document loading
Testable: 65%+ test coverage with component-level unit tests
Type-Safe: Pydantic models for all RAG data structures
Maintainable: Clear separation of concerns, easy to extend and debug

Installation & Setup¶

Prerequisites¶

Python Version: 3.11 or 3.12 (as specified in pyproject.toml)

Package Manager: This project uses uv for dependency management.

Quick Start¶

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/JustaKris/Trump-Rally-Speeches-NLP-Chatbot.git
cd Trump-Rally-Speeches-NLP-Chatbot

# Install dependencies (creates .venv automatically)
uv sync

# Configure environment
cp .env.example .env
# Edit .env: Set LLM_API_KEY and LLM_PROVIDER

# Run the server
uv run uvicorn speech_nlp.app:app --reload

API available at http://localhost:8000.

Dependencies¶

Core RAG dependencies (automatically installed with uv sync):

chromadb>=0.5.0 — Vector database for embeddings
sentence-transformers>=3.3.0 — MPNet embeddings (768d)
langchain>=0.3.0 — Document chunking utilities
rank-bm25>=0.2.2 — BM25 keyword search
google-generativeai>=0.8.0 — Gemini LLM (default)

Optional LLM Providers:

# Install OpenAI support
uv sync --group llm-openai

# Install Claude support
uv sync --group llm-anthropic

Set LLM_PROVIDER=openai or LLM_PROVIDER=anthropic in .env after installing.

System Architecture¶

Core Components¶

Orchestration:

RAGService (services/rag/service.py) - Manages ChromaDB collection and coordinates components

Specialized Services (services/rag/):

SearchEngine (search_engine.py) - Hybrid search with semantic, BM25, and cross-encoder reranking
RAGGuardrails (guardrails.py) - Three-layer pipeline protection: query validation, relevance filtering, grounding verification
ConfidenceCalculator (confidence.py) - Multi-factor confidence scoring
EntityAnalyzer (entity_analyzer.py) - Entity extraction, sentiment, co-occurrence analysis
QueryRewriter (query_rewriter.py) - LLM-powered query optimisation for improved search retrieval
DocumentLoader (document_loader.py) - Semantic chunking with embedding-based topic boundary detection; extracts structured metadata (location, date, year) from speech filenames

Supporting Services:

GeminiLLM (services/llm_service.py) - Answer generation with Google Gemini

Core Architecture¶

Vector Database¶

ChromaDB with persistent storage
MPNet embeddings (768 dimensions) for semantic understanding
Efficient querying with smart deduplication

Search Engine¶

Hybrid search combining dense embeddings with BM25 sparse retrieval
Cross-encoder reranking for precision optimization
Configurable weights for semantic vs keyword balance
Deduplication removes duplicate results by ID

LLM Integration¶

Pluggable LLM Providers: Gemini (default), OpenAI GPT, or Anthropic Claude
Configuration: Via LLM_PROVIDER and LLM_API_KEY environment variables
Model Selection: Configurable via LLM_MODEL_NAME (e.g., gemini-2.0-flash-exp, gpt-4o-mini, claude-3-5-sonnet-20241022)
Context-aware prompting: Entity-focused generation for targeted queries
Fallback extraction: Works without LLM (extraction-based answers)

Advanced Features¶

Multi-factor confidence scoring
Entity extraction and analytics
Sentiment analysis for entities
Co-occurrence analysis
Source attribution with citations

Key Features¶

1. Intelligent Question Answering¶

Ask natural language questions and receive AI-generated answers with supporting evidence.

Example:

response = rag.ask("What economic policies were discussed?", top_k=5)

Response includes:

Generated answer from Gemini
5 supporting context chunks
Confidence score with explanation
Source document attribution
Entity statistics (if applicable)

2. Multi-Factor Confidence Scoring¶

Sophisticated confidence assessment handled by ConfidenceCalculator component.

Confidence Factors (weighted):

Retrieval Quality (40%) — Semantic similarity of retrieved chunks
Consistency (25%) — Low variance in scores = higher confidence
Coverage (20%) — Number of supporting chunks (normalized 0-1)
Entity Coverage (15%) — For entity queries, mention frequency

Confidence Levels:

High: combined_score ≥ 0.7
Medium: 0.4 ≤ combined_score < 0.7
Low: combined_score < 0.4

Example output:

{
  "confidence": "high",
  "confidence_score": 0.87,
  "confidence_explanation": "Overall confidence is HIGH (score: 0.87) based on excellent semantic match (similarity: 0.91), very consistent results (consistency: 0.93), 5 supporting context chunks",
  "confidence_factors": {
    "retrieval_score": 0.91,
    "consistency": 0.93,
    "chunk_coverage": 5,
    "entity_coverage": 0.84
  }
}

3. Entity Analytics & Confidence Explainability¶

The EntityAnalyzer component and confidence system provide transparency into how the system works.

Confidence Explanation¶

Every answer includes a human-readable explanation of why it has a certain confidence level:

Example:

"Overall confidence is MEDIUM (score: 0.59) based on weak semantic match (similarity: 0.22), very consistent results (consistency: 1.00), 5 supporting context chunks, 'Biden' mentioned in all retrieved chunks."

What It Explains:

Retrieval quality (semantic similarity)
Result consistency (variance in scores)
Coverage (number of supporting chunks)
Entity coverage (for entity queries)

Entity Detection & Statistics¶

Automatic entity detection with comprehensive analytics:

Features:

Mention counts — How many times entity appears across entire corpus
Speech coverage — Which specific speeches mention the entity
Corpus percentage — Percentage of documents containing entity
Sentiment analysis — Average sentiment toward entity using FinBERT
Analyzes up to 50 chunks containing the entity
Converts scores to -1 (negative) to +1 (positive)
Classifies as Positive, Neutral, or Negative
Co-occurrence analysis — Most common terms appearing near entity
Extracts words from contexts containing entity
Filters stopwords
Returns top 5 associated terms

Example output:

{
  "entity_statistics": {
    "Biden": {
      "mention_count": 524,
      "speech_count": 30,
      "corpus_percentage": 25.03,
      "speeches": ["OhioSep21_2020.txt", "BemidjiSep18_2020.txt", ...],
      "sentiment": {
        "average_score": -0.61,
        "classification": "Negative",
        "sample_size": 50
      },
      "associations": ["socialism", "weakness", "failure", "china", "corrupt"]
    }
  }
}

Use Cases:

Research: "How often is Biden mentioned in these speeches?"
Sentiment tracking: "What's the average sentiment about Biden?"
Context discovery: "What topics are associated with healthcare?"
Coverage analysis: "Which speeches mention climate change?"

4. Hybrid Search¶

SearchEngine component combines semantic and keyword search for optimal retrieval:

Semantic search — Dense embeddings capture meaning and context (MPNet 768d)
BM25 keyword search — Ensures exact term matches aren't missed
Score combination — Configurable weights (default: 0.7 semantic, 0.3 BM25)
Cross-encoder reranking — Optional final precision optimization
Deduplication — Removes duplicate results by ID

Search Modes:

semantic - Pure vector similarity
hybrid - Combined semantic + BM25 (default)
reranking - Adds cross-encoder pass

5. Semantic Document Chunking¶

DocumentLoader component implements custom embedding-based semantic chunking — not LangChain's off-the-shelf splitter.

How It Works:

Sentence tokenisation — NLTK splits document text into individual sentences
Embedding — Each sentence is embedded with the same MPNet model used for search
Similarity scoring — Cosine similarity is computed between consecutive sentence embeddings
Breakpoint detection — Sentences where similarity drops below a percentile-based threshold (default: 90^th percentile) mark topic boundaries
Group merging — Sentences between breakpoints are merged into coherent chunks; groups smaller than semantic_min_chunk_size are folded into their neighbour
Overflow splitting — Any group exceeding chunk_size falls back to RecursiveCharacterTextSplitter so no chunk is ever too large

Why It Matters:

Fixed-size chunking cuts mid-paragraph, mid-sentence, even mid-word. Semantic chunking produces chunks that represent complete ideas — each chunk covers a coherent topic segment. This directly improves retrieval quality because the embedding for a coherent chunk is more meaningful than one for an arbitrary text slice.

Results: ~2,354 semantically coherent chunks from 35 speeches (vs ~1,082 with fixed 2048-char chunking).

Configuration:

rag:
  chunking_strategy: "semantic"            # "semantic" or "fixed"
  semantic_min_chunk_size: 256             # Merge groups smaller than this
  semantic_breakpoint_percentile: 90.0     # Percentile for topic-shift detection
  # semantic_similarity_threshold: null    # Override percentile with absolute threshold

6. Three-Layer RAG Guardrails¶

The RAGGuardrails component prevents hallucination and ensures answer quality through a three-layer protection pipeline.

Layer 1 — Pre-Retrieval Query Validation:

Rejects queries before any search is performed:

Empty or whitespace-only queries
Queries shorter than 3 characters

When triggered, returns a structured "no information" response immediately — no wasted compute on search/LLM.

Layer 2 — Post-Retrieval Relevance Filtering:

After search results are returned (and cross-encoder reranked), each result's relevance score is checked against a configurable threshold.

The scoring pipeline:

Cross-encoder (ms-marco-MiniLM-L-6-v2) produces raw logits for each query–document pair
Logits are sigmoid-normalised to a 0–1 probability: score = 1 / (1 + exp(-logit))
Results below the threshold are dropped

The service fetches 2× the requested top_k candidates to provide filtering headroom. If all results fall below the threshold, the system returns "I don't have enough information" rather than passing irrelevant context to the LLM.

Threshold Calibration

The default threshold (0.01) is calibrated for the ms-marco-MiniLM-L-6-v2 cross-encoder on political speech transcripts. This model produces very negative logits (−3 to −7) for broad topical queries, making sigmoid values small. A threshold of 0.01 (logit ≈ −4.6) filters true noise while preserving the cross-encoder's relative ranking for legitimate queries. Tune via similarity_threshold in your config.

Layer 3 — Post-Generation Grounding Verification:

After the LLM generates an answer, a token-overlap heuristic checks whether the answer content is grounded in the retrieved context:

Extract content words from the answer (stop-word filtered, ~70+ common English words removed)
Extract content words from all retrieved context chunks
Compute overlap ratio: grounding_score = |answer_words ∩ context_words| / |answer_words|
If the score falls below grounding_threshold (default 0.3), append a caveat warning

Refusal phrases ("I don't have enough information") always pass the grounding check.

Response Metadata:

Every response includes guardrails metadata:

{
  "guardrails": {
    "enabled": true,
    "triggered": false,
    "relevance_filtered": 3,
    "grounding_score": 0.72,
    "grounding_passed": true
  }
}

Configuration:

rag:
  guardrails_enabled: true        # Enable/disable all guardrails
  similarity_threshold: 0.01      # Min sigmoid-normalised relevance score
  grounding_threshold: 0.3        # Min token-overlap for grounding check

7. LLM-Powered Query Rewriting¶

The QueryRewriter component optimises user queries before search to improve retrieval quality.

How It Works:

User submits a natural language question
The LLM rewrites it to be more search-friendly (fix typos, expand abbreviations, add synonyms)
The rewritten query is used for semantic search
The original query is preserved for entity extraction and LLM answer generation

Design Decisions:

Deterministic rewrites: Uses temperature=0.0 to ensure consistent results
Safety guards: Empty-query passthrough, disabled passthrough via config, error fallback to original query, suspiciously-long-rewrite rejection (>5× original length)
Separation of concerns: Rewritten query drives search; original query drives entity analysis and answer generation. This preserves user intent while improving retrieval.

Example:

Original Query	Rewritten Query
`"wut did trump say bout the wall"`	`"What did Trump say about the border wall and immigration?"`
`"economy"`	`"What were the economic policies and economy-related topics discussed?"`
`"What was said about China?"`	`"What was said about China?"` (unchanged — already optimal)

Response Metadata:

Every response includes query rewriting metadata when active:

{
  "query_rewriting": {
    "enabled": true,
    "original_query": "wut about the wall",
    "rewritten_query": "What did Trump say about the border wall and immigration?"
  }
}

Configuration:

rag:
  query_rewriting_enabled: true   # Enable/disable query rewriting

8. Extended Chunk Metadata¶

Each speech filename encodes the rally location and date. The extract_speech_metadata() function in DocumentLoader parses this automatically during document loading, enriching every chunk with structured metadata.

Filename Pattern: {Location}{MonthDay}_{Year}.txt

Extracted Fields:

Field	Type	Example
`location`	`str`	`"Battle Creek"`
`year`	`int`	`2019`
`month`	`int`	`12`
`day`	`int`	`19`
`date`	`str`	`"2019-12-19"` (ISO format)

Edge Cases Handled:

CamelCase multi-word locations: BattleCreek → "Battle Creek", LasVegas → "Las Vegas"
Hyphenated locations: Winston-Salem → preserved as "Winston-Salem"
Filenames that don't match the pattern: metadata is omitted gracefully (no error)

How Metadata Flows Through the Pipeline:

Document loading — extract_speech_metadata() parses the filename; fields are merged into each chunk's metadata dict alongside source, chunk_index, total_chunks
Vector storage — ChromaDB stores the enriched metadata; available for future metadata filtering
Search results — ContextChunk.from_search_result() propagates location, date, year from search result metadata
LLM context — Source labels in the prompt include location and date (e.g., [Source 1: file.txt, Part 3, Battle Creek, 2019-12-19])
API response — The context list in each response includes location, date, and year for every chunk

Example API response context entry:

{
  "text": "The economy is doing tremendously well...",
  "source": "BattleCreekDec19_2019.txt",
  "chunk_index": 3,
  "score": 0.82,
  "location": "Battle Creek",
  "date": "2019-12-19",
  "year": 2019
}

API Usage¶

Basic Question¶

# cURL
curl -X POST "http://localhost:8000/rag/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What was said about the economy?", "top_k": 5}'

# Python
import requests

response = requests.post(
    "http://localhost:8000/rag/ask",
    json={"question": "What was said about the economy?", "top_k": 5}
)

result = response.json()
print(result["answer"])
print(f"Confidence: {result['confidence']} ({result['confidence_score']:.2f})")

Entity Query¶

response = requests.post(
    "http://localhost:8000/rag/ask",
    json={"question": "What did Trump say about Biden?", "top_k": 10}
)

result = response.json()

# View entity statistics
if "entity_statistics" in result:
    for entity, stats in result["entity_statistics"].items():
        print(f"\n{entity}:")
        print(f"  Mentions: {stats['mention_count']}")
        print(f"  Sentiment: {stats['sentiment']['classification']}")
        print(f"  Associated: {', '.join(stats['associations'][:3])}")

Semantic Search¶

response = requests.post(
    "http://localhost:8000/rag/search",
    json={"query": "immigration policy", "top_k": 5}
)

results = response.json()["results"]
for i, result in enumerate(results, 1):
    print(f"\n{i}. Source: {result['source']}")
    print(f"   Similarity: {result['similarity']:.3f}")
    print(f"   Preview: {result['text'][:100]}...")

Configuration¶

Environment Variables¶

# .env
LLM_API_KEY=your-api-key-here
LLM_PROVIDER=gemini  # Options: gemini, openai, anthropic
LLM_MODEL_NAME=gemini-2.0-flash-exp

# Alternative: OpenAI
# LLM_API_KEY=sk-your-openai-key
# LLM_PROVIDER=openai
# LLM_MODEL_NAME=gpt-4o-mini

# Alternative: Claude
# LLM_API_KEY=sk-ant-your-key
# LLM_PROVIDER=anthropic
# LLM_MODEL_NAME=claude-3-5-sonnet-20241022

RAGService Parameters¶

from speech_nlp.services.rag.service import RAGService

rag = RAGService(
    collection_name="speeches",
    persist_directory="./data/chromadb",
    embedding_model="all-mpnet-base-v2",      # 768d embeddings
    reranker_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    chunk_size=2048,                          # ~512-768 tokens
    chunk_overlap=150,                        # ~100-150 tokens
    llm_service=llm_service,                  # Pluggable LLM provider
    use_reranking=True,                       # Enable cross-encoder
    use_hybrid_search=True,                   # Enable BM25 + semantic
    query_rewriting_enabled=True,             # Enable LLM query rewriting
)

Note: Hybrid search weights (semantic_weight, keyword_weight) are configured in the SearchEngine component, not at the service level.

Component Initialization¶

The RAG service automatically initializes all components:

# Initialized internally:
# - DocumentLoader (for chunking)
# - SearchEngine (for hybrid retrieval)
# - RAGGuardrails (for three-layer protection)
# - QueryRewriter (for LLM query optimisation)
# - ConfidenceCalculator (for scoring)
# - EntityAnalyzer (for entity extraction)
# - GeminiLLM (for answer generation, if use_llm=True)

API Endpoint Configuration¶

Default top_k: 5 chunks
Maximum top_k: 15 chunks
Increase for complex/entity queries

Performance¶

First Request¶

~30-60 seconds (model downloads + document indexing)
Downloads ~1-2 GB of models (one-time)

Subsequent Requests¶

~1-3 seconds for typical queries
~2-5 seconds for entity analytics (sentiment analysis)

Optimization Opportunities¶

Cache entity statistics
Pre-compute embeddings
Async sentiment analysis
Redis for query caching

Technical Details¶

Models Used¶

Embeddings: sentence-transformers/all-mpnet-base-v2 (768d)
Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2
LLM: Google Gemini 2.5 Flash
Sentiment: ProsusAI/finbert

Database¶

ChromaDB 0.5.0 with SQLite persistence
Vector index: HNSW for efficient similarity search
Metadata filtering: Source, chunk index, timestamps

Prompt Engineering¶

Context-limited to 4000 characters max
Source attribution in context
Entity-focused instructions when entities detected
Structured output format
Safety settings for political content

Limitations & Future Work¶

Current Limitations¶

Entity extraction uses simple heuristics (capitalization)
Sentiment analysis may show neutral for complex political text
No query caching (every request recomputes)
Synchronous processing (no async optimization)

Future Enhancements¶

Integrate proper NER (spaCy or Hugging Face)
Add query caching layer (Redis)
Implement async processing
Add temporal analysis (sentiment over time)
Entity relationship graphs
Fine-tune embeddings on domain data

Data Migration¶

If you're upgrading from a previous version with different embeddings:

uv run python scripts/migrate_rag_embeddings.py

This migration script will:

Clear existing ChromaDB collection
Reload documents with new embeddings
Re-index all 35 speeches (~1082 chunks)
Verify indexing completed successfully

When to run migration:

After changing embedding models
After updating ChromaDB version
After modifying chunk size/overlap settings
When experiencing search quality issues

Development Workflow¶

Running Tests¶

# Run RAG service tests
uv run pytest tests/test_rag_integration.py -v

# Run search engine tests
uv run pytest tests/test_search_engine.py -v

# Run entity analyzer tests
uv run pytest tests/test_entity_analyzer.py -v

# Run confidence calculator tests
uv run pytest tests/test_confidence.py -v

# Run all RAG-related tests with coverage
uv run pytest tests/test_*rag*.py tests/test_*search*.py tests/test_*entity*.py tests/test_*confidence*.py --cov=speech_nlp.services.rag

Code Quality¶

# Lint and format
uv run ruff check src/speech_nlp/services/rag/
uv run ruff format src/speech_nlp/services/rag/

# Type checking
uv run mypy src/speech_nlp/services/rag/

Local Development¶

# Run with hot reload
uv run uvicorn speech_nlp.app:app --reload --log-level debug

# Test RAG endpoint manually
curl -X POST http://localhost:8000/rag/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What was said about immigration?", "top_k": 5}'

# Check RAG statistics
curl http://localhost:8000/rag/stats

Debugging Tips¶

Enable verbose logging:

# In configs/development.yaml
logging:
  level: DEBUG
  format: pretty  # Colored console output

Inspect retrieved chunks:

from speech_nlp.services.rag.service import RAGService

rag = RAGService()
results = rag.search("immigration policy", top_k=5)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {result['source']} (similarity: {result['similarity']:.3f})")
    print(f"   {result['text'][:200]}...")

Q&A System (RAG)¶

Overview¶

Installation & Setup¶

Prerequisites¶

Quick Start¶

Dependencies¶

System Architecture¶

Core Components¶

Core Architecture¶

Vector Database¶

Search Engine¶

LLM Integration¶

Advanced Features¶

Key Features¶

1. Intelligent Question Answering¶

2. Multi-Factor Confidence Scoring¶

3. Entity Analytics & Confidence Explainability¶

Confidence Explanation¶

Entity Detection & Statistics¶

4. Hybrid Search¶

5. Semantic Document Chunking¶

6. Three-Layer RAG Guardrails¶

7. LLM-Powered Query Rewriting¶

8. Extended Chunk Metadata¶

API Usage¶

Basic Question¶

Entity Query¶

Semantic Search¶

Configuration¶

Environment Variables¶

RAGService Parameters¶

Component Initialization¶

API Endpoint Configuration¶

Performance¶

First Request¶

Subsequent Requests¶

Optimization Opportunities¶

Technical Details¶

Models Used¶

Database¶

Prompt Engineering¶

Limitations & Future Work¶

Current Limitations¶

Future Enhancements¶

Data Migration¶

Development Workflow¶

Running Tests¶

Code Quality¶

Local Development¶

Debugging Tips¶

See Also¶