A Python script to scrape CMA (Competition and Markets Authority) merger case documents, specifically Initial Enforcement Orders (IEOs), Derogations, and Revocations.
Install required dependencies:
pip install requests beautifulsoup4 lxml pandas openpyxlpython Scrape.py --out ./output --query-ieo-onlypython Scrape.py --out ./output --all-merger-casespython Scrape.py --out ./output --max-cases 10 --query-ieo-onlyThe script creates:
-
CSV Index:
cma_ieo_derogs_revocations_index.csv -
Excel Index:
cma_ieo_derogs_revocations_index.xlsx -
ZIP Bundle:
cma_initial_orders_derogs_revocations.zip- Contains index files (CSV & XLSX)
- Contains PDFs organized by:
{Case}/{Category}/{filename}.pdf - Categories:
IEOs,Derogations,Revocations,Other
-
Downloads folder:
downloads/(intermediate storage for PDFs)
- If a download fails, it will appear in the manifest (CSV/XLSX) with an empty
local_path - Failed downloads are NOT included in the ZIP file
- Warning messages are printed to stderr for failed downloads
- The manifest provides complete visibility into all discovered documents
The script reuses the --out directory. To avoid confusion:
- Remove old artifacts with
rm -rf ./outputbefore running fresh tests - Or use a different output directory each time
Issue: When all downloads failed (e.g., network issues), the ZIP file only contained index files with no PDFs.
Root Cause: Failed downloads were not tracked in the manifest.
Fix: Failed downloads now appear in the manifest with empty local_path, providing visibility into what failed while keeping the ZIP clean (only successful downloads).
# Quick test with 5 cases
python Scrape.py --out ./test --max-cases 5 --query-ieo-only
# Full scrape of all IEO-related cases
python Scrape.py --out ./full_output --query-ieo-only
# Full scrape of ALL merger cases (comprehensive but slow)
python Scrape.py --out ./complete --all-merger-cases- Missing dependencies: Run
pip install requests beautifulsoup4 lxml pandas openpyxl - Empty ZIP: Check the manifest CSV/XLSX to see if downloads failed (empty
local_path) - Network errors: The script prints warnings to stderr; check for connection issues