Skip to content

Commit afe0f3a

Browse files
authored
Merge pull request #774 from ramyashreeshetty/add-bigquery-pipeline-audit-prompt
Add bigquery pipeline audit prompt
2 parents 3d0b777 + f1ff12f commit afe0f3a

File tree

2 files changed

+131
-0
lines changed

2 files changed

+131
-0
lines changed

docs/README.prompts.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ Ready-to-use prompt templates for specific development scenarios and tasks, defi
3131
| [Azure Cosmos DB NoSQL Data Modeling Expert System Prompt](../prompts/cosmosdb-datamodeling.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcosmosdb-datamodeling.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcosmosdb-datamodeling.prompt.md) | Step-by-step guide for capturing key application requirements for NoSQL use-case and produce Azure Cosmos DB Data NoSQL Model design using best practices and common patterns, artifacts_produced: "cosmosdb_requirements.md" file and "cosmosdb_data_model.md" file |
3232
| [Azure Cost Optimize](../prompts/az-cost-optimize.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Faz-cost-optimize.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Faz-cost-optimize.prompt.md) | Analyze Azure resources used in the app (IaC files and/or resources in a target rg) and optimize costs - creating GitHub issues for identified optimizations. |
3333
| [Azure Resource Health & Issue Diagnosis](../prompts/azure-resource-health-diagnose.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fazure-resource-health-diagnose.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fazure-resource-health-diagnose.prompt.md) | Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems. |
34+
| [BigQuery Pipeline Audit: Cost, Safety and Production Readiness](../prompts/bigquery-pipeline-audit.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fbigquery-pipeline-audit.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fbigquery-pipeline-audit.prompt.md) | Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations. |
3435
| [Boost Prompt](../prompts/boost-prompt.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fboost-prompt.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fboost-prompt.prompt.md) | Interactive prompt refinement workflow: interrogates scope, deliverables, constraints; copies final markdown to clipboard; never writes code. Requires the Joyride extension. |
3536
| [C# Async Programming Best Practices](../prompts/csharp-async.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-async.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-async.prompt.md) | Get best practices for C# async programming |
3637
| [C# Documentation Best Practices](../prompts/csharp-docs.prompt.md)<br />[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-docs.prompt.md)<br />[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-docs.prompt.md) | Ensure that C# types are documented with XML comments and follow best practices for documentation. |
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
agent: 'agent'
3+
tools: ['search/codebase', 'edit/editFiles', 'search']
4+
description: 'Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.'
5+
---
6+
7+
# BigQuery Pipeline Audit: Cost, Safety and Production Readiness
8+
9+
You are a senior data engineer reviewing a Python + BigQuery pipeline script.
10+
Your goals: catch runaway costs before they happen, ensure reruns do not corrupt
11+
data, and make sure failures are visible.
12+
13+
Analyze the codebase and respond in the structure below (A to F + Final).
14+
Reference exact function names and line locations. Suggest minimal fixes, not
15+
rewrites.
16+
17+
---
18+
19+
## A) COST EXPOSURE: What will actually get billed?
20+
21+
Locate every BigQuery job trigger (`client.query`, `load_table_from_*`,
22+
`extract_table`, `copy_table`, DDL/DML via query) and every external call
23+
(APIs, LLM calls, storage writes).
24+
25+
For each, answer:
26+
- Is this inside a loop, retry block, or async gather?
27+
- What is the realistic worst-case call count?
28+
- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set?
29+
For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?
30+
- Is the same SQL and params being executed more than once in a single run?
31+
Flag repeated identical queries and suggest query hashing plus temp table caching.
32+
33+
**Flag immediately if:**
34+
- Any BQ query runs once per date or once per entity in a loop
35+
- Worst-case BQ job count exceeds 20
36+
- `maximum_bytes_billed` is missing on any `client.query` call
37+
38+
---
39+
40+
## B) DRY RUN AND EXECUTION MODES
41+
42+
Verify a `--mode` flag exists with at least `dry_run` and `execute` options.
43+
44+
- `dry_run` must print the plan and estimated scope with zero billed BQ execution
45+
(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls
46+
- `execute` requires explicit confirmation for prod (`--env=prod --confirm`)
47+
- Prod must not be the default environment
48+
49+
If missing, propose a minimal `argparse` patch with safe defaults.
50+
51+
---
52+
53+
## C) BACKFILL AND LOOP DESIGN
54+
55+
**Hard fail if:** the script runs one BQ query per date or per entity in a loop.
56+
57+
Check that date-range backfills use one of:
58+
1. A single set-based query with `GENERATE_DATE_ARRAY`
59+
2. A staging table loaded with all dates then one join query
60+
3. Explicit chunks with a hard `MAX_CHUNKS` cap
61+
62+
Also check:
63+
- Is the date range bounded by default (suggest 14 days max without `--override`)?
64+
- If the script crashes mid-run, is it safe to re-run without double-writing?
65+
- For backdated simulations, verify data is read from time-consistent snapshots
66+
(`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables).
67+
Flag any read from a "latest" or unversioned table when running in backdated mode.
68+
69+
Suggest a concrete rewrite if the current approach is row-by-row.
70+
71+
---
72+
73+
## D) QUERY SAFETY AND SCAN SIZE
74+
75+
For each query, check:
76+
- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or
77+
any function that prevents pruning
78+
- **No `SELECT *`**: only columns actually used downstream
79+
- **Joins will not explode**: verify join keys are unique or appropriately scoped
80+
and flag any potential many-to-many
81+
- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after
82+
partition filtering, not on full table scans
83+
84+
Provide a specific SQL fix for any query that fails these checks.
85+
86+
---
87+
88+
## E) SAFE WRITES AND IDEMPOTENCY
89+
90+
Identify every write operation. Flag plain `INSERT`/append with no dedup logic.
91+
92+
Each write should use one of:
93+
1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`)
94+
2. Write to a staging table scoped to the run, then swap or merge into final
95+
3. Append-only with a dedupe view:
96+
`QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1`
97+
98+
Also check:
99+
- Will a re-run create duplicate rows?
100+
- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional
101+
and documented?
102+
- Is `run_id` being used as part of the merge or dedupe key? If so, flag it.
103+
`run_id` should be stored as a metadata column, not as part of the uniqueness
104+
key, unless you explicitly want multi-run history.
105+
106+
State the recommended approach and the exact dedup key for this codebase.
107+
108+
---
109+
110+
## F) OBSERVABILITY: Can you debug a failure?
111+
112+
Verify:
113+
- Failures raise exceptions and abort with no silent `except: pass` or warn-only
114+
- Each BQ job logs: job ID, bytes processed or billed when available,
115+
slot milliseconds, and duration
116+
- A run summary is logged or written at the end containing:
117+
`run_id, env, mode, date_range, tables written, total BQ jobs, total bytes`
118+
- `run_id` is present and consistent across all log lines
119+
120+
If `run_id` is missing, propose a one-line fix:
121+
`run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')`
122+
123+
---
124+
125+
## Final
126+
127+
**1. PASS / FAIL** with specific reasons per section (A to F).
128+
**2. Patch list** ordered by risk, referencing exact functions to change.
129+
**3. If FAIL: Top 3 cost risks** with a rough worst-case estimate
130+
(e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").

0 commit comments

Comments
 (0)