Skip to content

Database Monitoring and Auditing

Guide for monitoring database health, auditing data quality, and troubleshooting refresh strategy issues.

Overview

The project includes two Jupyter notebooks for database monitoring:

  • 00_database_monitoring.ipynb: Daily operational monitoring
  • 01_database_audit.ipynb: Comprehensive quality auditing

Both notebooks are located in notebooks/database/.

Daily Monitoring (00_database_monitoring.ipynb)

Purpose

Quick health check of database state and recent activity.

What It Covers

  1. Database Overview
  2. Total movie count
  3. Date range coverage
  4. ID coverage (TMDB/IMDb)
  5. Enrichment percentages

  6. Enrichment Status by Year

  7. Movies per year with TMDB/OMDB data
  8. Visual charts of enrichment coverage
  9. Completion rates by year

  10. Movies Due for Update

  11. Movies needing refresh by year
  12. Never-refreshed vs overdue counts
  13. Top years needing attention

  14. Data Quality Metrics

  15. Field completeness percentages
  16. Missing data identification
  17. Quality trends over time

  18. Recent Update Activity

  19. TMDB/OMDB updates (last 14 days)
  20. Full refresh counts
  21. Activity trends and patterns

  22. Custom Queries

  23. Top genres
  24. Monthly growth
  25. Ad-hoc analysis

When to Use

  • Daily: Quick health check before/after collection runs
  • After major updates: Verify expected changes
  • Before ML training: Ensure data quality
  • Troubleshooting: Identify gaps or issues

Example Usage

# Run the notebook to see:
# - Total movies: 32,534
# - TMDB enriched: 100%
# - OMDB enriched: 11.6%
# - Movies due for update: 462
# - Recent activity: 150 updates in last 14 days

Quality Auditing (01_database_audit.ipynb)

Purpose

Comprehensive data quality and refresh strategy validation.

What It Audits

1. Refresh Strategy Validation

Refresh Flag Consistency:

  • Verifies last_full_refresh is set correctly
  • Checks for movies with both updates but missing flag
  • Validates flag matches actual enrichment state

Freezing Logic Validation:

  • Confirms only archived movies (>365 days) are frozen
  • Verifies no recent/established/mature movies frozen
  • Analyzes frozen percentage by age category

Consecutive Unchanged Tracking:

  • Distribution of unchanged cycle counts
  • Validates movies frozen at correct threshold (3+ cycles)
  • Identifies movies frozen prematurely

2. Data Completeness

Core Movie Fields:

  • Title completeness
  • Release date coverage
  • TMDB/IMDb ID coverage
  • Update timestamp presence

TMDB Enrichment Fields:

  • Overview text
  • Genres
  • Production companies/countries
  • Runtime and vote counts
  • Spoken languages

OMDB Enrichment Fields:

  • Age ratings
  • IMDb ratings and votes
  • Metascores
  • Awards
  • Box office data
  • Language/country info

3. Timestamp Consistency

  • Verifies update timestamps match data presence
  • Identifies orphaned timestamps (timestamp but no data)
  • Finds missing timestamps (data but no timestamp)
  • Ensures referential integrity

4. API Enrichment Coverage

  • Coverage percentages by year (last 20 years)
  • Identifies years with low enrichment
  • Visual trends of TMDB/OMDB coverage
  • Highlights gaps needing attention

5. Data Quality Outliers

Suspicious Ratings:

  • Large discrepancies between TMDB and IMDb ratings (>2.5 points)
  • Suspiciously high ratings (>9.5)
  • Potential data quality issues

Missing Critical Fields:

  • Fully refreshed movies missing key data
  • Movies flagged for manual review
  • Incomplete enrichment identification

6. Overall Health Score

Calculates a 0-100 health score based on:

  • Refresh flag consistency (target: 100%)
  • Freezing accuracy (target: 100%)
  • Data completeness (target: >95%)
  • Timestamp consistency (target: 100%)

Generates prioritized recommendations for improvements.

When to Use

  • Weekly: Regular health checks during active development
  • Monthly: Production monitoring and maintenance
  • After migrations: Verify database state post-update
  • Before releases: Ensure data quality meets standards
  • Troubleshooting: Deep-dive into suspected issues

Example Output

OVERALL HEALTH SCORE: 96.8/100

✅ Refresh Consistency:      100.0/100
✅ Freezing Accuracy:        100.0/100
⚠️  Data Completeness:        89.5/100
✅ Timestamp Consistency:    99.0/100

RECOMMENDED ACTIONS:
1. 📊 Prioritize enrichment for movies with incomplete core fields
2. 📅 Focus OMDB enrichment on 3 years with <70% coverage

Key Metrics to Monitor

Health Indicators

1. Enrichment Coverage

Target: 90%+ for recent movies (last 5 years)

SELECT
    COUNT(CASE WHEN last_tmdb_update IS NOT NULL THEN 1 END) * 100.0 / COUNT(*) as tmdb_pct,
    COUNT(CASE WHEN last_omdb_update IS NOT NULL THEN 1 END) * 100.0 / COUNT(*) as omdb_pct
FROM movies
WHERE release_date >= DATE_SUB(CURRENT_DATE, INTERVAL 5 YEAR);

2. Refresh Flag Accuracy

Target: 100% (all movies with both updates have last_full_refresh set)

SELECT
    COUNT(*) as has_both,
    COUNT(CASE WHEN last_full_refresh IS NOT NULL THEN 1 END) as has_flag,
    COUNT(CASE WHEN last_full_refresh IS NULL THEN 1 END) as missing_flag
FROM movies
WHERE last_tmdb_update IS NOT NULL AND last_omdb_update IS NOT NULL;

3. Frozen Movie Percentage

Target: <10% of total movies

SELECT
    COUNT(CASE WHEN data_frozen = TRUE THEN 1 END) as frozen,
    COUNT(*) as total,
    COUNT(CASE WHEN data_frozen = TRUE THEN 1 END) * 100.0 / COUNT(*) as frozen_pct
FROM movies;

4. Movies Due for Update

Target: <5% of database

-- Movies overdue for refresh
SELECT COUNT(*) FROM movies
WHERE data_frozen = FALSE
AND (
    (DATEDIFF('day', release_date, CURRENT_DATE) <= 60 AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) > 5)
    OR (DATEDIFF('day', release_date, CURRENT_DATE) > 60 AND DATEDIFF('day', release_date, CURRENT_DATE) <= 180
        AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) > 15)
    -- ... etc for other age categories
);

Performance Indicators

1. Daily API Call Estimate

Track to stay within API limits:

-- TMDB calls needed today
SELECT COUNT(*) as tmdb_calls_needed
FROM movies
WHERE data_frozen = FALSE
AND (
    last_tmdb_update IS NULL
    OR (DATEDIFF('day', release_date, CURRENT_DATE) <= 60 AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) >= 5)
    OR (DATEDIFF('day', release_date, CURRENT_DATE) BETWEEN 61 AND 180 AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) >= 15)
    OR (DATEDIFF('day', release_date, CURRENT_DATE) BETWEEN 181 AND 365 AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) >= 30)
    OR (DATEDIFF('day', release_date, CURRENT_DATE) > 365 AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) >= 90)
);

2. Unchanged Refresh Distribution

Monitor freezing efficiency:

SELECT
    consecutive_unchanged_refreshes as cycles,
    COUNT(*) as count,
    COUNT(CASE WHEN data_frozen = TRUE THEN 1 END) as frozen
FROM movies
GROUP BY consecutive_unchanged_refreshes
ORDER BY consecutive_unchanged_refreshes;
SELECT
    DATE(last_tmdb_update) as date,
    COUNT(*) as updates
FROM movies
WHERE last_tmdb_update >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
GROUP BY date
ORDER BY date DESC;

Troubleshooting Guide

Issue: High Frozen Movie Count

Symptom: >20% of movies frozen

Diagnosis:

-- Check frozen distribution by age
SELECT
    CASE
        WHEN DATEDIFF('day', release_date, CURRENT_DATE) <= 60 THEN 'Recent'
        WHEN DATEDIFF('day', release_date, CURRENT_DATE) <= 180 THEN 'Established'
        WHEN DATEDIFF('day', release_date, CURRENT_DATE) <= 365 THEN 'Mature'
        ELSE 'Archived'
    END as age_category,
    COUNT(CASE WHEN data_frozen = TRUE THEN 1 END) as frozen,
    COUNT(*) as total
FROM movies
GROUP BY age_category;

Solutions:

  1. Unfreeze non-archived movies (should never happen):
UPDATE movies SET data_frozen = FALSE, consecutive_unchanged_refreshes = 0
WHERE DATEDIFF('day', release_date, CURRENT_DATE) < 365;
  1. Reset archived movies if needed:
-- Unfreeze all archived movies
UPDATE movies SET data_frozen = FALSE, consecutive_unchanged_refreshes = 0
WHERE DATEDIFF('day', release_date, CURRENT_DATE) >= 365;
  1. Force refresh frozen movies:
ayne collect daily --include-frozen

Issue: Missing last_full_refresh Flags

Symptom: Movies have both updates but last_full_refresh is NULL

Diagnosis:

SELECT COUNT(*) as affected
FROM movies
WHERE last_tmdb_update IS NOT NULL
AND last_omdb_update IS NOT NULL
AND last_full_refresh IS NULL;

Solution: Run the migration to backfill:

uv run python scripts/migration_add_unchanged_counter.py

Or manually:

UPDATE movies
SET last_full_refresh = GREATEST(last_tmdb_update, last_omdb_update)
WHERE last_tmdb_update IS NOT NULL
AND last_omdb_update IS NOT NULL
AND last_full_refresh IS NULL;

Issue: Low Enrichment Coverage

Symptom: <80% OMDB coverage for recent years

Diagnosis:

SELECT
    EXTRACT(YEAR FROM release_date) as year,
    COUNT(*) as total,
    COUNT(CASE WHEN last_omdb_update IS NOT NULL THEN 1 END) as enriched,
    COUNT(CASE WHEN last_omdb_update IS NOT NULL THEN 1 END) * 100.0 / COUNT(*) as pct
FROM movies
WHERE release_date >= DATE_SUB(CURRENT_DATE, INTERVAL 5 YEAR)
GROUP BY year
ORDER BY year DESC;

Solutions:

  1. Target recent years:
ayne omdb enrich --min-year 2023 --max-movies 1000
  1. Increase daily limits:
ayne collect daily --omdb-limit 500
  1. Check for missing IMDb IDs:
-- Movies that can't be enriched (no IMDb ID)
SELECT COUNT(*) FROM movies
WHERE imdb_id IS NULL AND last_tmdb_update IS NOT NULL;

Issue: Timestamp Inconsistencies

Symptom: Data present but no update timestamp, or vice versa

Diagnosis:

-- Orphaned TMDB timestamps
SELECT COUNT(*) FROM movies m
LEFT JOIN tmdb_movies t ON m.tmdb_id = t.tmdb_id
WHERE m.last_tmdb_update IS NOT NULL AND t.tmdb_id IS NULL;

-- Missing TMDB timestamps
SELECT COUNT(*) FROM movies m
LEFT JOIN tmdb_movies t ON m.tmdb_id = t.tmdb_id
WHERE t.tmdb_id IS NOT NULL AND m.last_tmdb_update IS NULL;

Solutions:

  1. Remove orphaned timestamps:
UPDATE movies SET last_tmdb_update = NULL
WHERE movie_id IN (
    SELECT m.movie_id FROM movies m
    LEFT JOIN tmdb_movies t ON m.tmdb_id = t.tmdb_id
    WHERE m.last_tmdb_update IS NOT NULL AND t.tmdb_id IS NULL
);
  1. Add missing timestamps from data:
UPDATE movies SET last_tmdb_update = CURRENT_TIMESTAMP
WHERE movie_id IN (
    SELECT m.movie_id FROM movies m
    LEFT JOIN tmdb_movies t ON m.tmdb_id = t.tmdb_id
    WHERE t.tmdb_id IS NOT NULL AND m.last_tmdb_update IS NULL
);

Best Practices

1. Regular Monitoring Schedule

  • Daily: Quick check with 00_database_monitoring.ipynb before collection runs
  • Weekly: Full audit with 01_database_audit.ipynb during active development
  • Monthly: Comprehensive audit in production

2. Proactive Maintenance

  • Monitor frozen movie percentage (keep <15%)
  • Track enrichment coverage (maintain >90% for recent movies)
  • Watch for API quota issues (daily call estimates)
  • Review consecutive unchanged distributions

3. Documentation

  • Log audit results and trends
  • Document any manual interventions
  • Track health score over time
  • Note recommendations implemented

4. Automation

Consider scheduling monitoring:

# Weekly audit report (Linux/Mac cron)
0 8 * * 1 cd /path/to/project && uv run jupyter nbconvert --execute notebooks/database/01_database_audit.ipynb --to html --output /path/to/reports/audit_$(date +\%Y\%m\%d).html

# Daily health check before collection
0 1 * * * cd /path/to/project && uv run jupyter nbconvert --execute notebooks/database/00_database_monitoring.ipynb --to html --output /path/to/reports/daily_$(date +\%Y\%m\%d).html

Quick Reference

Health Check Checklist

  • [ ] Overall health score >90
  • [ ] Refresh flag consistency 100%
  • [ ] Freezing accuracy 100%
  • [ ] Data completeness >95%
  • [ ] Timestamp consistency >98%
  • [ ] Frozen movies <15% of total
  • [ ] Recent movies (last 5 years) >90% OMDB enriched
  • [ ] Movies due for update <5% of database
  • [ ] No suspicious ratings requiring review
  • [ ] No critical fields missing in enriched movies

Common Queries

-- Quick health summary
SELECT
    COUNT(*) as total_movies,
    COUNT(CASE WHEN data_frozen = TRUE THEN 1 END) * 100.0 / COUNT(*) as frozen_pct,
    COUNT(CASE WHEN last_full_refresh IS NOT NULL THEN 1 END) * 100.0 / COUNT(*) as fully_refreshed_pct,
    AVG(consecutive_unchanged_refreshes) as avg_unchanged_cycles
FROM movies;

-- Recent enrichment activity
SELECT
    DATE(last_tmdb_update) as date,
    COUNT(*) as tmdb_updates,
    COUNT(CASE WHEN last_omdb_update = last_tmdb_update THEN 1 END) as omdb_updates
FROM movies
WHERE last_tmdb_update >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
GROUP BY date
ORDER BY date DESC;

-- Top issues to address
SELECT
    'Missing full refresh flags' as issue,
    COUNT(*) as count
FROM movies
WHERE last_tmdb_update IS NOT NULL AND last_omdb_update IS NOT NULL AND last_full_refresh IS NULL

UNION ALL

SELECT
    'Non-archived frozen movies' as issue,
    COUNT(*) as count
FROM movies
WHERE data_frozen = TRUE AND DATEDIFF('day', release_date, CURRENT_DATE) < 365

UNION ALL

SELECT
    'Movies due for update' as issue,
    COUNT(*) as count
FROM movies
WHERE data_frozen = FALSE
AND last_full_refresh IS NOT NULL
AND DATEDIFF('day', last_tmdb_update, CURRENT_DATE) >
    CASE
        WHEN DATEDIFF('day', release_date, CURRENT_DATE) <= 60 THEN 5
        WHEN DATEDIFF('day', release_date, CURRENT_DATE) <= 180 THEN 15
        WHEN DATEDIFF('day', release_date, CURRENT_DATE) <= 365 THEN 30
        ELSE 90
    END;