OMDB API Client

The OMDB (Open Movie Database) API client provides async functionality for fetching detailed movie ratings, awards, and metadata to enrich the core TMDB data.

Overview

The OMDB client is designed to complement TMDB data with additional ratings and metadata:

Async/await using httpx.AsyncClient for concurrent requests
Rate limiting via shared AsyncRateLimiter (default: 2 requests/second)
Batch operations for efficient multi-movie fetching
Automatic retries with exponential backoff
Data normalization to consistent storage format
Graceful error handling for missing or invalid IMDb IDs

API Endpoint

OMDB uses a single, simple endpoint for all queries:

Movie by IMDb ID

Base URL: http://www.omdbapi.com/

Method: GET

Required Parameters:

apikey: Your OMDB API key
i: IMDb ID (e.g., "tt0111161" for The Shawshank Redemption)

Optional Parameters:

plot: "short" (default) or "full"
type: "movie", "series", or "episode"

Response: JSON object with movie details

Rate Limit: Free tier typically allows 1,000 requests/day

Data We Collect

Core OMDB Fields

The OMDB API provides rich data that complements TMDB:

Ratings & Scores:

imdbRating: IMDb user rating (0-10)
imdbVotes: Number of IMDb votes (e.g., "2,500,000")
Metascore: Metacritic score (0-100)
Ratings: Array of ratings from multiple sources:
IMDb
Rotten Tomatoes (percentage)
Metacritic

Awards & Recognition:

Awards: Text description (e.g., "Won 2 Oscars. 12 wins & 15 nominations")
Useful for analyzing critical acclaim vs. commercial success

Additional Metadata:

Rated: MPAA rating (G, PG, PG-13, R, etc.)
Runtime: Length in minutes
Genre: Comma-separated genres
Director: Director name(s)
Writer: Writer name(s)
Actors: Main cast (comma-separated)
Plot: Synopsis
Language: Primary language(s)
Country: Country of origin
BoxOffice: Box office earnings (US format, e.g., "$28,341,469")
Released: Release date

Usage Examples

Basic Initialization

from ayne.data_collection.omdb import OMDBClient

# Initialize with default settings
client = OMDBClient()

# Initialize with custom settings
client = OMDBClient(
    api_key="your_api_key",
    requests_per_second=1.0,  # Slower rate for free tier
    max_concurrent=3,         # Fewer concurrent requests
    output_dir=Path("custom/path")
)

Fetch Single Movie

# Fetch by IMDb ID
movie = await client.get_movie_by_imdb_id("tt0111161")

if movie:
    print(f"Title: {movie['title']}")
    print(f"IMDb Rating: {movie['imdb_rating']}")
    print(f"Metascore: {movie['metascore']}")
    print(f"Awards: {movie['awards']}")

Batch Fetching

# Fetch multiple movies
imdb_ids = ["tt0111161", "tt0068646", "tt0071562", "tt0468569"]
movies = await client.get_batch_movies(imdb_ids)

print(f"Successfully fetched {len(movies)} movies")

With Progress Tracking

def progress_callback(current, total):
    print(f"Progress: {current}/{total} ({current/total*100:.1f}%)")

imdb_ids = ["tt0111161", "tt0068646", "tt0071562"]
movies = await client.get_batch_movies(
    imdb_ids,
    progress_callback=progress_callback
)

Rate Limiting Strategy

Why Conservative Limits?

OMDB has stricter rate limits than TMDB:

Free tier: 1,000 requests/day
Paid tier: Higher limits available

Our Default Configuration

# Default: 2 requests/second, 5 concurrent
client = OMDBClient(
    requests_per_second=2.0,
    max_concurrent=5
)

This translates to:

~7,200 requests/hour (theoretical max)
~172,800 requests/day (theoretical max)
In practice, much lower due to concurrent limits

Token Bucket Implementation

Uses shared AsyncRateLimiter:

Tokens replenish at requests_per_second rate
Semaphore limits concurrent requests
Async context manager ensures proper resource handling

Error Handling

Automatic Retries

# Configured in _request method
await retry_with_backoff(
    make_request,
    retry_count=3,      # Try up to 3 times
    base_delay=1.0,     # Start with 1s delay
    max_delay=10.0,     # Max 10s between retries
    exceptions=(httpx.HTTPError, httpx.TimeoutException)
)

Graceful Degradation

# Individual failures don't stop batch processing
movies = await client.get_batch_movies(imdb_ids)

# Returns only successful fetches
# Failures are logged but don't raise exceptions

Common Error Scenarios

Invalid IMDb ID: Returns None, logs error
Network timeout: Retries up to 3 times
Rate limit exceeded: Backs off automatically
Movie not found: Returns None, not an error
Daily quota exceeded: Saves partial data and raises APIRateLimitExceeded

Quota Exceeded Handling

When the OMDB daily quota is exceeded (401 Unauthorized):

try:
    movies = await client.get_batch_movies(imdb_ids)
except APIRateLimitExceeded as e:
    # Exception includes partial data that was successfully fetched
    print(f"Fetched {e.items_processed}/{e.total_requested} movies")
    print(f"Partial data available: {len(e.partial_data)} movies")

    # Orchestrator automatically saves partial_data before re-raising
    # No data is lost when quota is hit

Key Features:

Successfully fetched movies are included in the exception
Orchestrator saves all partial data to database before re-raising
CLI reports accurate progress (e.g., "990/1,100 movies successfully enriched")
Timestamps are updated for saved movies
No data loss when quota limit is reached

Data Normalization

Raw OMDB responses are normalized before storage.

Input (Raw OMDB Response)

{
  "Response": "True",
  "imdbID": "tt0111161",
  "Title": "The Shawshank Redemption",
  "imdbRating": "9.3",
  "imdbVotes": "2,500,000",
  "Metascore": "80",
  "Awards": "Nominated for 7 Oscars...",
  "Ratings": [
    {"Source": "Internet Movie Database", "Value": "9.3/10"},
    {"Source": "Rotten Tomatoes", "Value": "91%"},
    {"Source": "Metacritic", "Value": "80/100"}
  ],
  "BoxOffice": "$28,341,469",
  ...
}

Output (Normalized)

{
    "imdb_id": "tt0111161",
    "title": "The Shawshank Redemption",
    "imdb_rating": 9.3,              # Converted to float
    "imdb_votes": 2500000,           # Cleaned and converted
    "metascore": 80,                 # Converted to int
    "rotten_tomatoes_score": 91,    # Extracted and converted
    "awards": "Nominated for 7 Oscars...",
    "box_office": 28341469,          # Cleaned and converted
    # ... other normalized fields
}

Normalization Logic

Located in src/data_collection/omdb/normalizers.py:

def normalize_movie_response(data: Dict) -> Optional[Dict[str, Any]]:
    """Convert OMDB API response to normalized format.

    Handles:
    - Type conversions (strings to numbers)
    - Cleaning (remove commas, currency symbols)
    - Extraction (ratings from nested structure)
    - Null handling (empty strings to None)
    """

Integration with Data Collection

Typical Workflow

OMDB data is fetched after TMDB data:

TMDB Discovery: Find movies by year/criteria
TMDB Details: Fetch full details including imdb_id
OMDB Enrichment: Use IMDb IDs to fetch ratings/awards
Storage: Save to omdb_movies table

Orchestrator Integration

from ayne.data_collection.orchestrator import DataCollectionOrchestrator

# Orchestrator manages the workflow
orchestrator = DataCollectionOrchestrator(db)

# Refresh workflow automatically:
# 1. Determines which movies need OMDB updates
# 2. Extracts their IMDb IDs from tmdb_movies table
# 3. Fetches OMDB data in batches
# 4. Updates omdb_movies table
# 5. Updates timestamps in movies table
stats = await orchestrator.refresh_movie_data(
    movies_df,
    fetch_omdb=True
)

Why After TMDB?

OMDB requires IMDb IDs
TMDB provides IMDb IDs in movie details
Allows us to skip OMDB for movies without IMDb IDs

Database Schema

Storage Table: `omdb_movies`

CREATE TABLE omdb_movies (
    imdb_id VARCHAR PRIMARY KEY,
    title VARCHAR,
    year INTEGER,
    rated VARCHAR,              -- MPAA rating
    released DATE,
    runtime INTEGER,            -- minutes
    genre VARCHAR,
    director VARCHAR,
    writer VARCHAR,
    actors VARCHAR,
    plot TEXT,
    language VARCHAR,
    country VARCHAR,
    awards TEXT,
    imdb_rating DOUBLE,
    imdb_votes INTEGER,
    metascore INTEGER,
    rotten_tomatoes_score INTEGER,
    box_office BIGINT,          -- USD cents
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

Relationship to Other Tables

movies (main table)
  ├─ tmdb_id → tmdb_movies (TMDB details)
  │            └─ imdb_id → omdb_movies (OMDB enrichment)
  ├─ imdb_id (copied from tmdb_movies)
  └─ last_omdb_update (timestamp tracking)

Best Practices

1. Always Check for IMDb IDs

# Filter out movies without IMDb IDs before calling OMDB
imdb_ids = [id for id in movie_ids if id and id.startswith('tt')]

2. Respect Rate Limits

# For free tier, be extra conservative
client = OMDBClient(requests_per_second=1.0)

3. Handle Missing Data

# OMDB data is optional enrichment
movie = await client.get_movie_by_imdb_id(imdb_id)
if movie:
    # Use OMDB data
else:
    # Continue without OMDB data

4. Batch Operations

# Always prefer batch operations over individual requests
# Bad: Multiple individual calls
for imdb_id in imdb_ids:
    await client.get_movie_by_imdb_id(imdb_id)

# Good: Single batch call
await client.get_batch_movies(imdb_ids)

5. Monitor API Quota

Track daily request count
Set up alerts for quota thresholds
Consider paid tier if exceeding free limits

Comparison: OMDB vs TMDB

Feature	TMDB	OMDB
Primary Use	Discovery & details	Ratings & awards
Rate Limit	4/sec (we use)	2/sec (we use)
Free Tier	40,000/day	1,000/day
Identifier	`tmdb_id`	`imdb_id`
Ratings	User votes only	IMDb, RT, Metacritic
Box Office	Budget/revenue	US box office
Awards	❌ No	✅ Yes
Collection Order	First	Second

Troubleshooting

OMDB Data Not Updating

Check:

Is the movie in tmdb_movies with valid imdb_id?
Does the movie pass refresh interval check?
Are OMDB API credentials configured?

# Debug: Check if movie has IMDb ID
query = "SELECT tmdb_id, imdb_id FROM tmdb_movies WHERE tmdb_id = ?"
result = db.query(query, [12345])

Rate Limit Errors

Solution: Reduce requests_per_second

client = OMDBClient(requests_per_second=0.5)  # Slower

Missing Ratings

Some movies may not have all rating sources:

IMDb rating usually present
Rotten Tomatoes may be missing
Metacritic may be missing

This is normal and handled by normalization.

API Reference

For complete OMDB API documentation: OMDB API Docs

TMDB Client - Primary data source
Data Orchestration - Workflow coordination
Refresh Strategy - When to update OMDB data
Rate Limiting - Token bucket implementation