Skip to content

IMDB scraper broken: AWS WAF blocks all HTML requests — migration to API endpoints needed #1966

@Arny80Hexa

Description

@Arny80Hexa

Problem

IMDB has enabled AWS WAF JavaScript challenges on all www.imdb.com HTML endpoints. Non-browser HTTP clients (including MediaElch) receive HTTP 202 with an empty response body. The response header x-amzn-waf-action: challenge confirms the block.

This affects all HTML-based IMDB functionality:

  • Search (/find?q=...) — no results
  • Title page (/title/ttXXXX/) — no details
  • Reference page (/title/ttXXXX/reference/) — no additional data

The issue has been reported before as intermittent (#1952), but as of March 20, 2026 it appears to be permanent. The current IMDB scraper is completely non-functional.

Working alternatives

Two IMDB API endpoints remain accessible and return JSON directly (no HTML parsing needed):

1. Suggest API (for search)

  • URL: https://v3.sg.media-imdb.com/suggestion/x/{query}.json
  • Method: GET, no authentication
  • Returns: IMDB ID, title, year, type (movie/tv/short), poster URL, top cast
  • Example: Searching "Inception" returns tt1375666, year 2010, type "movie", poster, cast

2. GraphQL API (for details)

  • URL: https://graphql.imdb.com/
  • Method: POST with JSON body, no authentication
  • Returns: Virtually all title metadata — ratings, plot, genres, runtime, cast, crew, Metacritic score, etc.
  • Example query:
{ title(id: "tt1375666") {
    titleText { text }
    releaseYear { year }
    ratingsSummary { aggregateRating voteCount }
    plot { plotText { plainText } }
    genres { genres { text } }
    metacritic { metascore { score } }
    runtime { seconds }
} }

Note on terms of use

The GraphQL API response includes a disclaimer: "Public, commercial, and/or non-private use of the IMDb data provided by this API is not allowed." MediaElch is LGPL-licensed and non-commercial, but this should be considered.

Affected code

  • src/scrapers/imdb/ImdbApi.cpp — URL construction, HTTP requests
  • src/scrapers/imdb/ImdbSearchPage.cpp — search result parsing (HTML-based)
  • src/scrapers/imdb/ImdbJsonParser.cpp — title detail parsing from __NEXT_DATA__
  • src/scrapers/imdb/ImdbReferencePage.cpp — reference page parsing
  • All movie and TV scraper jobs that depend on these classes

Proposed approach

Replace the HTML-based scraper with API-based requests:

  1. Search: Replace ImdbSearchPage with Suggest API parser
  2. Details: Replace ImdbJsonParser + ImdbReferencePage with GraphQL API queries
  3. Preserve the existing interfaceImdbApi remains the entry point, only the internal implementation changes

This would also resolve or improve several existing issues:

Closing PRs #1955 and #1956 as they are based on the now-blocked HTML approach.

Analyzed with AI assistance (Claude Code / Opus 4.6).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions