Configuration Guide¶

This project uses Pydantic Settings v2 for type-safe configuration, combining YAML config files with environment variable overrides. This is a modern, cloud-friendly pattern that works well for local development and deployments on AWS or other platforms.

Configuration Architecture¶

Core Components¶

src/speech_nlp/config/settings.py - Central configuration module with Settings class
YAML config files in configs/ (e.g., configs/development.yaml, configs/production.yaml)
.env file - Environment variables for sensitive values and overrides
Validation - Automatic type checking and validation via Pydantic

Benefits¶

✅ Type-safe - Compile-time checking of configuration values
✅ Environment-aware - Different configs for dev/staging/prod
✅ Cloud-friendly - Works seamlessly with Azure, AWS, GCP
✅ Validated - Invalid configs fail fast with clear error messages
✅ Documented - Self-documenting with type hints and descriptions

Quick Start¶

1. Choose Your Environment (YAML)¶

Configuration defaults live in YAML files under configs/:

configs/development.yaml – for local development
configs/production.yaml – for production deployments (AWS, Azure, etc.)

By default, the app uses the development environment. You can override this via the ENVIRONMENT environment variable:

ENVIRONMENT=production

The active environment name is used to pick configs/<environment>.yaml.

2. Create Your `.env` File¶

Copy the example file:

cp .env.example .env

Use .env for secrets and overrides only (API keys, tokens, one-off tweaks). All non-sensitive defaults should live in YAML.

3. Set Your LLM Provider¶

Edit .env and configure your preferred LLM provider (sensitive values like API keys stay here):

Option A: Google Gemini (Default)¶

LLM_PROVIDER=gemini
LLM_API_KEY=your_gemini_api_key_here
LLM_MODEL_NAME=gemini-2.0-flash-exp

Get a free key at: https://ai.google.dev/

Option B: OpenAI¶

# Install OpenAI support
uv sync --group llm-openai

LLM_PROVIDER=openai
LLM_API_KEY=sk-your_openai_api_key_here
LLM_MODEL_NAME=gpt-4o-mini

Option C: Anthropic (Claude)¶

# Install Anthropic support
uv sync --group llm-anthropic

LLM_PROVIDER=anthropic
LLM_API_KEY=sk-ant-your_anthropic_api_key_here
LLM_MODEL_NAME=claude-3-5-sonnet-20241022

4. Run the Application¶

uv run uvicorn speech_nlp.app:app --host 0.0.0.0 --port 8000 --reload

The app will automatically:

Load base defaults from configs/<ENVIRONMENT>.yaml
Apply Pydantic model defaults for any missing values
Override with environment variables / .env values
Validate all configuration values
Initialize services with configured parameters
Display startup configuration in logs

Configuration Options¶

Application Settings¶

Core metadata and logging live primarily in YAML:

# configs/development.yaml
environment: development
log_level: DEBUG
app_name: "Trump Speeches NLP Chatbot API (Development)"
app_version: "0.1.0"

You can still override via .env or environment variables if needed:

ENVIRONMENT="production"   # selects configs/production.yaml
LOG_LEVEL="INFO"           # overrides YAML
APP_NAME="Custom Name"     # overrides YAML

LLM Provider (Multi-Provider Support)¶

Configure which LLM provider to use for answer generation, sentiment interpretation, and topic analysis.

General LLM Settings¶

In YAML we configure non-sensitive defaults under the llm section:

llm:
  provider: "gemini"          # gemini | openai | anthropic | none
  enabled: true
  model_name: "gemini-2.5-flash"
  temperature: 0.3
  max_output_tokens: 1024

Sensitive values like API keys are supplied via environment variables / .env:

LLM_PROVIDER="gemini"          # optional override for provider
LLM_API_KEY="your_api_key"     # Single API key for active provider
LLM_MODEL_NAME="model-name"    # optional override for model
LLM_TEMPERATURE="0.7"          # optional override for temperature
LLM_MAX_OUTPUT_TOKENS="2048"   # optional override for max tokens
LLM_ENABLED="true"             # optional override

Provider-Specific Examples¶

Gemini (Default - Always Available):

LLM_PROVIDER="gemini"
LLM_API_KEY="your_gemini_api_key"
LLM_MODEL_NAME="gemini-2.0-flash-exp"  # or gemini-1.5-pro
LLM_TEMPERATURE="0.7"
LLM_MAX_OUTPUT_TOKENS="2048"

OpenAI (Optional - Install with uv sync --group llm-openai):

LLM_PROVIDER="openai"
LLM_API_KEY="sk-your_openai_api_key"
LLM_MODEL_NAME="gpt-4o-mini"  # or gpt-4o, gpt-4-turbo
LLM_TEMPERATURE="0.7"
LLM_MAX_OUTPUT_TOKENS="2048"

Anthropic (Optional - Install with uv sync --group llm-anthropic):

LLM_PROVIDER="anthropic"
LLM_API_KEY="sk-ant-your_anthropic_api_key"
LLM_MODEL_NAME="claude-3-5-sonnet-20241022"  # or claude-3-opus-20240229
LLM_TEMPERATURE="0.7"
LLM_MAX_OUTPUT_TOKENS="2048"

Disable LLM:

LLM_PROVIDER="none"
LLM_ENABLED="false"

Switching Providers¶

Install optional provider (if not already installed):

uv sync --group llm-openai      # For OpenAI
uv sync --group llm-anthropic   # For Anthropic
uv sync --group llm-all         # For all providers

Update .env file with new provider settings
Restart application:

uv run uvicorn speech_nlp.app:app --reload

The application will automatically use the new provider without code changes.

ML Models¶

Configure which models to use for different tasks via YAML:

models:
  sentiment_model_name: "ProsusAI/finbert"
  embedding_model_name: "all-mpnet-base-v2"
  reranker_model_name: "cross-encoder/ms-marco-MiniLM-L-6-v2"
  emotion_model_name: "j-hartmann/emotion-english-distilroberta-base"

You can override any of them via environment variables if needed:

SENTIMENT_MODEL_NAME="ProsusAI/finbert"
EMBEDDING_MODEL_NAME="all-mpnet-base-v2"
RERANKER_MODEL_NAME="cross-encoder/ms-marco-MiniLM-L-6-v2"
EMOTION_MODEL_NAME="j-hartmann/emotion-english-distilroberta-base"

RAG Configuration¶

These live under the rag section in YAML:

rag:
  chromadb_persist_directory: "./data/chromadb"
  chromadb_collection_name: "speeches"
  chunk_size: 2048
  chunk_overlap: 150
  default_top_k: 5
  use_reranking: true
  use_hybrid_search: true

Environment variables can override them if necessary (e.g. for a one-off deployment):

CHROMADB_PERSIST_DIRECTORY="./data/chromadb"
CHROMADB_COLLECTION_NAME="speeches"
CHUNK_SIZE="2048"
CHUNK_OVERLAP="150"
DEFAULT_TOP_K="5"
USE_RERANKING="true"
USE_HYBRID_SEARCH="true"

Semantic Chunking¶

Control how documents are split into chunks:

rag:
  chunking_strategy: "semantic"            # "semantic" or "fixed"
  semantic_min_chunk_size: 256             # Merge groups smaller than this (bytes)
  semantic_breakpoint_percentile: 90.0     # Percentile for topic-shift detection
  # semantic_similarity_threshold: null    # Override percentile with an absolute threshold

When chunking_strategy is "semantic", the DocumentLoader embeds each sentence, computes consecutive cosine similarities, and splits at topic boundaries. Groups smaller than semantic_min_chunk_size are merged with their neighbour; groups exceeding chunk_size fall back to RecursiveCharacterTextSplitter. Set "fixed" to use traditional character-based splitting.

RAG Guardrails¶

Three-layer quality gates that prevent hallucination and ensure answer grounding:

rag:
  guardrails_enabled: true       # Master switch for all guardrail layers
  similarity_threshold: 0.01     # Min sigmoid-normalised relevance score (0-1)
  grounding_threshold: 0.3       # Min token-overlap ratio for grounding check

Setting	Default	Description
`guardrails_enabled`	`true`	Enables pre-retrieval validation, post-retrieval relevance filtering, and post-generation grounding verification
`similarity_threshold`	`0.01`	Minimum sigmoid-normalised cross-encoder score. Results below this are dropped before reaching the LLM. Calibrated for the `ms-marco-MiniLM-L-6-v2` model on speech transcripts
`grounding_threshold`	`0.3`	Minimum token-overlap between the generated answer and the retrieved context. Answers below this get a caveat warning appended

Environment variable overrides:

RAG_GUARDRAILS_ENABLED="true"
RAG_SIMILARITY_THRESHOLD="0.01"
RAG_GROUNDING_THRESHOLD="0.3"

Query Rewriting¶

LLM-powered query optimisation that rewrites user queries before search to fix typos, expand abbreviations, and improve retrieval quality:

rag:
  query_rewriting_enabled: true   # Enable/disable LLM query rewriting

Setting	Default	Description
`query_rewriting_enabled`	`true`	Enables LLM-powered query rewriting before search. Requires an active LLM provider. Falls back to the original query on error

Environment variable override:

RAG_QUERY_REWRITING_ENABLED="true"

Data Directories¶

Configured under paths in YAML:

paths:
  data_root_directory: "./data"
  speeches_directory: "./data/Donald Trump Rally Speeches"

API Settings¶

These are grouped under the api section in YAML:

api:
  host: "0.0.0.0"
  port: 8000
  reload: true
  cors_origins:
    - "*"

In production (e.g. AWS) you might use something like:

# configs/production.yaml
api:
  host: "0.0.0.0"
  port: 8000
  reload: false
  cors_origins:
    - "https://your-domain.com"

Environment-Specific Configs¶

Development¶

# configs/development.yaml
environment: development
log_level: DEBUG
api:
  host: "0.0.0.0"
  port: 8000
  reload: true
  cors_origins:
    - "*"

Production example¶

# configs/production.yaml
environment: production
log_level: INFO
app_name: "Trump Speeches NLP Chatbot API"

api:
  host: "0.0.0.0"
  port: 8000
  reload: false
  cors_origins:
    - "https://your-domain.com"

Using Configuration in Code¶

Accessing Settings¶

from speech_nlp.config.settings import get_settings

settings = get_settings()

# Access values from nested sections
print(settings.llm.provider)
print(settings.rag.chunk_size)
print(settings.log_level)

Type-Safe Access¶

# All settings are type-checked
settings.rag.chunk_size              # int
settings.llm.temperature             # float
settings.rag.use_reranking           # bool
settings.llm.provider                # Literal["gemini", "openai", "anthropic", "none"]

Helper Methods¶

# Check if LLM is configured
if settings.is_llm_configured():
  api_key = settings.get_llm_api_key()
  model = settings.get_llm_model_name()

# Get Path objects
speeches_path = settings.get_speeches_path()
chromadb_path = settings.get_chromadb_path()

# Setup logging
settings.setup_logging()

Logging Configuration¶

The project uses src/speech_nlp/config/logging.py for production-ready logging with automatic format detection.

Log Levels¶

DEBUG: Detailed diagnostic information for troubleshooting
INFO: Important application events (default, recommended for production)
WARNING: Unexpected but recoverable situations
ERROR: Application errors requiring attention
CRITICAL: System-critical failures

Log Formats¶

Development (Colored)¶

Automatically enabled when ENVIRONMENT=development:

2025-11-04 12:34:56 | INFO     | speech_nlp.app       | Application startup complete
2025-11-04 12:34:57 | DEBUG    | speech_nlp.services.rag | Performing hybrid search

ANSI colors by level (green=INFO, red=ERROR, etc.)
Human-readable timestamps
Module names right-aligned

Production (JSON)¶

Automatically enabled when ENVIRONMENT=production:

{"timestamp": "2025-11-04 12:34:56", "level": "INFO", "name": "speech_nlp.app", "message": "Application startup complete"}
{"timestamp": "2025-11-04 12:34:57", "level": "DEBUG", "name": "speech_nlp.services.rag", "message": "Performing hybrid search"}

Machine-parseable JSON
Compatible with Azure Application Insights, CloudWatch, ELK stack
Automatic exception field for errors

Changing Log Settings¶

Edit .env:

# Log level
LOG_LEVEL="INFO"   # Recommended for production
LOG_LEVEL="DEBUG"  # Verbose for debugging

# Environment (affects format)
ENVIRONMENT="development"  # Colored logs
ENVIRONMENT="production"   # JSON logs

The logging system automatically:

Detects environment and chooses appropriate format
Suppresses noisy third-party loggers (chromadb, httpx, transformers)
Configures uvicorn logs
Filters ChromaDB telemetry errors

For detailed logging documentation, see docs/development/logging.md.

Azure Deployment¶

Azure App Service automatically loads environment variables. Configure them in:

Azure Portal: App Service → Configuration → Application Settings
Azure CLI:

az webapp config appsettings set --name myapp --resource-group mygroup \
  --settings GEMINI_API_KEY="your_key" LOG_LEVEL="INFO"

Docker Deployment¶

Using .env file¶

docker run --env-file .env -p 8000:8000 myapp

Using environment variables¶

docker run \
  -e GEMINI_API_KEY="your_key" \
  -e LOG_LEVEL="INFO" \
  -p 8000:8000 \
  myapp

Docker Compose¶

services:
  api:
    build: .
    environment:
      - GEMINI_API_KEY=${GEMINI_API_KEY}
      - LOG_LEVEL=${LOG_LEVEL:-INFO}
    env_file:
      - .env
    ports:
      - "8000:8000"

Validation¶

Pydantic automatically validates configuration:

Example Validation Errors¶

# Invalid log level
LOG_LEVEL="INVALID"
# ❌ Error: Invalid log level. Must be one of: DEBUG, INFO, WARNING, ERROR, CRITICAL

# Invalid chunk size
CHUNK_SIZE="not_a_number"
# ❌ Error: Input should be a valid integer

# Missing required API key (when LLM enabled)
LLM_ENABLED="true"
GEMINI_API_KEY=""
# ❌ Error: API key appears to be too short

Best Practices¶

Never commit .env - Add to .gitignore
Use .env.example - Document all available options
Validate early - Settings load at startup, fail fast
Environment-specific - Different configs for dev/prod
Security - Use Azure Key Vault for sensitive values in production
Logging - Use appropriate log levels for each environment

Troubleshooting¶

Settings not loading¶

Check:

.env file exists in project root (for secrets/overrides)
A YAML config exists at configs/<ENVIRONMENT>.yaml (or configs/development.yaml by default)
File encoding is UTF-8
No syntax errors in .env or YAML files

Invalid configuration¶

Check logs at startup:

ERROR: ValidationError: 1 validation error for Settings
  Invalid log level. Must be one of: DEBUG, INFO, WARNING, ERROR, CRITICAL

API key issues¶

# Check if API key is set
python -c "from speech_nlp.config.settings import get_settings; print(get_settings().get_llm_api_key())"

Migration from Old Code¶

If you were using environment variables directly:

Before:

import os
api_key = os.getenv("GEMINI_API_KEY")

After:

from speech_nlp.config import get_settings
settings = get_settings()
api_key = settings.gemini_api_key  # Type-safe!