Skip to content

luisggc/sqlcompare

Repository files navigation

SQLCompare

SQLCompare is a Python CLI and data diff tool for comparing SQL tables, SQL query results, CSV files, and Excel/XLSX files.

It is built for analytics engineers and data engineers who need to validate migrations, backfills, model rewrites, vendor file drops, and other data pipeline changes without relying on one-off SQL checks.

Use SQLCompare to compare datasets from PostgreSQL, Snowflake, Databricks, DuckDB, CSV, and XLSX sources with the same review workflow.


What SQLCompare Does

  • Compare SQL tables before and after a logic change
  • Compare SQL query results when tables are not materialized yet
  • Compare CSV and Excel files for local QA and vendor deliveries
  • Review row-level diffs with before/after values and missing-row detection
  • Run analytics regression testing with a repeatable saved diff_id

SQLCompare supports two complementary comparison modes:

  1. Row-by-row comparison with an ID: detect missing rows, identify changed columns, and inspect before/after values.
  2. Statistical comparison: compare null counts, distinct counts, and other aggregates to understand overall impact.

Use Cases

  • Validate a dbt model rewrite by comparing the old and new SQL outputs on the same key.
  • Compare warehouse tables before deploying a migration or backfill.
  • Compare SQL query results when testing filters, joins, or business logic changes.
  • Diff vendor CSV or XLSX deliveries before loading them into your warehouse.
  • Run data validation checks for analytics regression testing and release QA.

Supported Connectors and File Types

Source type Examples Supported workflow
SQL tables PostgreSQL, Snowflake, Databricks, DuckDB sqlcompare run table ...
SQL query results Inline SQL or .sql files sqlcompare run query ...
CSV files Local .csv datasets sqlcompare run file ...
Excel files Local .xlsx datasets sqlcompare run file ...
Dataset configs YAML definitions for SQL or file sources sqlcompare run dataset ...

Install

Recommended:

uv tool install sqlcompare
-- or
uv tool install "sqlcompare[snowflake] @ git+https://github.com/luisggc/sqlcompare" --python 3.11

Install optional connector extras as needed:

uv tool install "sqlcompare[<connector>]"

Examples:

uv tool install "sqlcompare[snowflake]"
uv tool install "sqlcompare[databricks]"

Quick Start

Compare SQL Tables

Compare two tables on a key:

export SQLCOMPARE_CONN_DEFAULT="postgresql://<user>:<pass>@<host>/<db>"
sqlcompare run table analytics.fact_sales analytics.fact_sales_new id

That command prints a diff_id. Use it for follow-up analysis:

sqlcompare review stats <diff_id>
sqlcompare review diff <diff_id> --column revenue --limit 100
sqlcompare review missing <diff_id> --side current
sqlcompare review meta <diff_id>
sqlcompare review export <diff_id> --mode summary
sqlcompare review export <diff_id> --mode complete --output ./reports/full_diff.xlsx

Compare SQL Query Results

Use this when tables are not materialized yet or when you want to compare a filtered slice.

Inline SQL:

sqlcompare run query \
  --previous "SELECT * FROM analytics.orders WHERE order_date < '2024-01-01'" \
  --current "SELECT * FROM analytics.orders WHERE order_date >= '2024-01-01'" \
  --index order_id \
  -c snowflake_prod

SQL files:

sqlcompare run query --previous queries/previous.sql --current queries/current.sql --index order_id -c snowflake_prod

Compare CSV and Excel Files

Use the same diff workflow for local file validation, vendor drops, and ad hoc QA.

sqlcompare run file path/to/previous.csv path/to/current.xlsx id

Review Report Export (XLSX)

You can export review results as a multi-tab Excel report using review export.

Modes:

  • --mode summary (default): creates Overview + per-column tabs (top 200 rows each) + SQL Reference
  • --mode complete: same tabs, but without the 200-row cap (limited only by XLSX limits)

Examples:

# Save summary report in current directory with generated timestamped filename
sqlcompare review export <diff_id> --mode summary

# Save full report to a specific location
sqlcompare review export <diff_id> --mode complete --output ./reports/full_diff.xlsx

# Save a single-column summary report
sqlcompare review export <diff_id> --mode summary --column revenue --output ./reports/revenue_diff.xlsx

Notes:

  • --output is optional; if omitted, SQLCompare generates a readable timestamped filename.
  • review export is for the standard diff view and should not be combined with review stats, review missing, or review columns.

Analytics Regression Testing and Data Validation

SQLCompare is useful when row counts are not enough and manual joins are too fragile. Instead of writing ad hoc SQL every time you change a model or pipeline, you can save a diff once and review it repeatedly with the generated diff_id.

Common validation workflows:

  • release QA for transformed tables
  • migration and backfill verification
  • regression testing after SQL logic changes
  • warehouse-to-file or file-to-file comparison during onboarding

Why SQLCompare Instead of Manual SQL Diff Checks

Manual SQL joins, notebooks, and spreadsheets usually answer one question once. SQLCompare keeps the comparison reusable:

  • run a compare once, then inspect stats, missing rows, and changed values
  • use the same workflow across warehouse tables and local files
  • export Excel review reports for debugging and handoff
  • avoid rebuilding the same outer join logic for every validation task

review meta for AI-Friendly Metadata

Use review meta to get a JSON payload describing the queryable tables and ready-to-run SQL templates for a given diff_id. This is especially useful for AI agents that need structured context before running analysis queries.

sqlcompare review meta <diff_id>

Output (JSON):

  • diff_id, resolved_diff_id, connection
  • index_columns, common_columns
  • tables: previous/current/join table names and their columns
  • queries: SQL templates for row-only checks and per-column diffs

Examples

See examples/ for datasets, commands, and captured outputs.


Core Idea: Compare Once, Analyze Many Times

SQLCompare does two things:

  1. Compare two datasets using an index (single or composite key)
  2. Persist comparison results and return a diff_id you can use to:
  • get overall stats
  • drill into a specific column’s changes
  • list missing rows (previous-only / current-only)
  • pull samples for debugging

Usage by Use Case

1) Compare SQL tables

Best for production validation and regression checks across supported connectors.

sqlcompare run table analytics.users analytics.users_new user_id,tenant_id

Why it’s useful:

  • Handles composite keys
  • Produces a saved diff ID for repeatable review

2) Compare SQL query results

Use this when tables are not materialized yet or you want a filtered slice.

Inline SQL:

sqlcompare run query \
  --previous "SELECT * FROM analytics.orders WHERE order_date < '2024-01-01'" \
  --current "SELECT * FROM analytics.orders WHERE order_date >= '2024-01-01'" \
  --index order_id \
  -c snowflake_prod

SQL files:

sqlcompare run query --previous queries/previous.sql --current queries/current.sql --index order_id -c snowflake_prod

Or create a dataset config:

previous:
  select_sql: "SELECT * FROM analytics.orders WHERE order_date < '2024-01-01'"
  index:
    - ORDER_ID

new:
  select_sql: "SELECT * FROM analytics.orders WHERE order_date >= '2024-01-01'"
  index:
    - ORDER_ID

Run the compare:

sqlcompare run dataset path/to/dataset.yaml

Why it’s useful:

  • Compare slices without touching production tables
  • Keep an auditable diff_id

3) Compare CSV and Excel files (DuckDB)

Great for ad hoc QA, one-off deliveries, or vendor drops. SQLCompare uses DuckDB under the hood — no DB server required.

Create a dataset config (supports {{here}} for relative paths):

previous:
  file_name: "{{here}}/previous.csv"
  index:
    - id

new:
  file_name: "{{here}}/current.xlsx"
  index:
    - id

Set a local default connector and run:

sqlcompare run file path/to/previous.csv path/to/current.xlsx id

Why it’s useful:

  • Same diff workflow as warehouses
  • Fast local comparisons, zero infra

4) Compare tables inside a DuckDB file

Use this when your data lives in a local .duckdb file.

export SQLCOMPARE_CONN_LOCAL="duckdb:////absolute/path/to/warehouse.duckdb"
sqlcompare run table raw.customers staged.customers id -c local

Why it’s useful:

  • Local + fast, without Snowflake/warehouse costs
  • Easy to integrate into lightweight pipelines

Configuration

SQLCompare resolves connectors in this order:

  1. default connector
  2. direct URL
  3. environment variables
  4. YAML files

Environment variables

  • SQLCOMPARE_CONN_DEFAULT — default connector URL (used when -c is omitted)
  • SQLCOMPARE_CONN_{NAME} — SQLAlchemy URL for a named connector
  • SQLCOMPARE_COMPARISON_SCHEMA — schema for comparison tables (default: sqlcompare)

Example:

export SQLCOMPARE_CONN_DEFAULT="postgresql://..."
export SQLCOMPARE_CONN_LOCAL="duckdb:////abs/path/to/db.duckdb"

YAML connections file (optional)

Location:

~/.sqlcompare/connections.yml

Example:

snowflake:
  drivername: snowflake
  username: my_user
  password: my_password
  host: my_account
  database: ANALYTICS
  schema: PUBLIC
  query:
    warehouse: COMPUTE_WH

Operational notes

  • SQLCompare creates a physical join table in the comparison schema. Ensure your connector has CREATE SCHEMA and CREATE TABLE privileges.
  • Table names are passed through directly to SQL. Provide fully-qualified or quoted names when required by your database.
  • Index columns must exist in both datasets; otherwise the command fails with a clear error.

Development

Run tests:

uv run pytest

About

Python CLI and data diff tool for comparing SQL tables, SQL query results, CSV, and Excel/XLSX files

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages