From 03a5f86605aa1befd4c8b8d2594f90b8f7ee4b0b Mon Sep 17 00:00:00 2001 From: yiouli Date: Fri, 17 Apr 2026 15:24:30 -0700 Subject: [PATCH 1/3] update eval-driven-dev skill --- docs/README.skills.md | 2 +- skills/eval-driven-dev/SKILL.md | 120 +++++-- .../references/1-a-project-analysis.md | 102 ++++++ ...{1-a-entry-point.md => 1-b-entry-point.md} | 6 +- .../references/1-b-eval-criteria.md | 82 ----- .../references/1-c-eval-criteria.md | 128 +++++++ .../references/2-wrap-and-trace.md | 260 -------------- .../references/2a-instrumentation.md | 134 +++++++ .../references/2b-implement-runnable.md | 145 ++++++++ .../references/2c-capture-and-verify-trace.md | 118 +++++++ .../references/3-define-evaluators.md | 116 +++--- .../references/4-build-dataset.md | 194 ++++++++-- .../eval-driven-dev/references/5-run-tests.md | 60 ++-- .../references/6-analyze-outcomes.md | 332 ++++++++++++++++++ .../references/6-investigate.md | 164 --------- .../eval-driven-dev/references/evaluators.md | 70 +++- .../references/runnable-examples/cli-app.md | 64 ++++ .../runnable-examples/fastapi-web-server.md | 126 +++++++ .../runnable-examples/standalone-function.md | 60 ++++ .../eval-driven-dev/references/testing-api.md | 25 +- skills/eval-driven-dev/references/wrap-api.md | 19 +- skills/eval-driven-dev/resources/setup.sh | 79 ++++- 22 files changed, 1711 insertions(+), 695 deletions(-) create mode 100644 skills/eval-driven-dev/references/1-a-project-analysis.md rename skills/eval-driven-dev/references/{1-a-entry-point.md => 1-b-entry-point.md} (85%) delete mode 100644 skills/eval-driven-dev/references/1-b-eval-criteria.md create mode 100644 skills/eval-driven-dev/references/1-c-eval-criteria.md delete mode 100644 skills/eval-driven-dev/references/2-wrap-and-trace.md create mode 100644 skills/eval-driven-dev/references/2a-instrumentation.md create mode 100644 skills/eval-driven-dev/references/2b-implement-runnable.md create mode 100644 skills/eval-driven-dev/references/2c-capture-and-verify-trace.md create mode 100644 skills/eval-driven-dev/references/6-analyze-outcomes.md delete mode 100644 skills/eval-driven-dev/references/6-investigate.md create mode 100644 skills/eval-driven-dev/references/runnable-examples/cli-app.md create mode 100644 skills/eval-driven-dev/references/runnable-examples/fastapi-web-server.md create mode 100644 skills/eval-driven-dev/references/runnable-examples/standalone-function.md diff --git a/docs/README.skills.md b/docs/README.skills.md index 6041106cc..5c81c4b56 100644 --- a/docs/README.skills.md +++ b/docs/README.skills.md @@ -134,7 +134,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to | [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None | | [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None | | [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None | -| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`
`references/1-b-eval-criteria.md`
`references/2-wrap-and-trace.md`
`references/3-define-evaluators.md`
`references/4-build-dataset.md`
`references/5-run-tests.md`
`references/6-investigate.md`
`references/evaluators.md`
`references/testing-api.md`
`references/wrap-api.md`
`resources` | +| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Improve AI application with evaluation-driven development. Define eval criteria, instrument the application, build golden datasets, observe and evaluate application runs, analyze results, and produce a concrete action plan for improvements. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-project-analysis.md`
`references/1-b-entry-point.md`
`references/1-c-eval-criteria.md`
`references/2a-instrumentation.md`
`references/2b-implement-runnable.md`
`references/2c-capture-and-verify-trace.md`
`references/3-define-evaluators.md`
`references/4-build-dataset.md`
`references/5-run-tests.md`
`references/6-analyze-outcomes.md`
`references/evaluators.md`
`references/runnable-examples`
`references/testing-api.md`
`references/wrap-api.md`
`resources` | | [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`
`references/excalidraw-schema.md`
`scripts/.gitignore`
`scripts/README.md`
`scripts/add-arrow.py`
`scripts/add-icon-to-diagram.py`
`scripts/split-excalidraw-library.py`
`templates` | | [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`
`references/pyspark.md` | | [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None | diff --git a/skills/eval-driven-dev/SKILL.md b/skills/eval-driven-dev/SKILL.md index 71da823ca..3dc40fbbc 100644 --- a/skills/eval-driven-dev/SKILL.md +++ b/skills/eval-driven-dev/SKILL.md @@ -1,26 +1,27 @@ --- name: eval-driven-dev description: > - Set up eval-based QA for Python LLM applications: instrument the app, - build golden datasets, write and run eval tests, and iterate on failures. + Improve AI application with evaluation-driven development. Define eval criteria, instrument the application, build golden datasets, observe and evaluate application runs, analyze results, and produce a concrete action plan for improvements. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. license: MIT -compatibility: Python 3.11+ +compatibility: Python 3.10+ metadata: - version: 0.6.1 - pixie-qa-version: ">=0.6.1,<0.7.0" + version: 0.8.1 + pixie-qa-version: ">=0.8.1,<0.9.0" pixie-qa-source: https://github.com/yiouli/pixie-qa/ --- # Eval-Driven Development for Python LLM Applications -You're building an **automated QA pipeline** that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`. +You're building an **automated evaluation pipeline** that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`. **What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM. During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path. +**Rule: The app's LLM calls must go to a real LLM.** Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable. + **The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset. This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline. @@ -29,8 +30,16 @@ This skill is about doing the work, not describing it. Read code, edit files, ru ## Before you start -**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, then run the setup.sh included in the skill's resources. -The script updates the `eval-driven-dev` skill and `pixie-qa` python package to the latest version, initializes the pixie working directory if it's not already initialized, and starts a web server in the background to show user updates. If the skill or package update fails, continue — do not let these failures block the rest of the workflow. +**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources. +The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, initialize the pixie working directory if it's not already initialized, and start a web server in the background to show user updates. + +**Setup error handling — what you can skip vs. what must succeed:** + +- **Skill update fails** → OK to continue. The existing skill version is sufficient. +- **pixie-qa upgrade fails but was already installed** → OK to continue with the existing version. +- **pixie-qa is NOT installed and installation fails** → **STOP.** Ask the user for help. The workflow cannot proceed without the `pixie` package. +- **`pixie init` fails** → **STOP.** Ask the user for help. +- **`pixie start` (web server) fails** → **STOP.** Ask the user for help. Check `server.log` in the pixie root directory for diagnostics. Common causes: port conflict, missing dependency, slow environment. Do NOT proceed without the web server — the user needs it to see eval results. --- @@ -45,6 +54,16 @@ Follow Steps 1–6 straight through without stopping. Do not ask the user for co - **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything. - **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one. +**When to stop and ask for help:** + +Some blockers cannot and should not be worked around. When you encounter any of the following, **stop immediately and ask the user for help** — do not attempt workarounds: + +- **Application won't run due to missing environment variables or configuration**: The app requires environment variables or configuration that are not set and cannot be inferred. Do NOT work around this by mocking, faking, or replacing application components — the eval must exercise real production code. Ask the user to fix the environment setup. +- **App import failures that indicate a broken project**: If the app's core modules cannot be imported due to missing system dependencies or incompatible Python versions (not just missing pip packages you can install), ask the user to fix the project setup. +- **Ambiguous entry point**: If the app has multiple equally plausible entry points and the project analysis doesn't clarify which one matters most, ask the user which to target. + +Blockers you SHOULD resolve yourself (do not ask): missing Python packages (install them), missing `pixie` package (install it), port conflicts (pick a different port), file permission issues (fix them). + **Run Steps 1–6 in sequence.** If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1. --- @@ -59,33 +78,61 @@ Follow Steps 1–6 straight through without stopping. Do not ask the user for co If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding. -Step 1 has two sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.** +Step 1 has three sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.** -#### Sub-step 1a: Entry point & execution flow +#### Sub-step 1a: Project analysis -> **Reference**: Read `references/1-a-entry-point.md` now. +> **Reference**: Read `references/1-a-project-analysis.md` now. -Read the source code to understand how the app starts and how a real user invokes it. Write your findings to `pixie_qa/01-entry-point.md` before moving on. +Before looking at code structure or entry points, understand what this software does in the real world — its purpose, its users, the complexity of real inputs, and where it fails. This understanding drives every downstream decision: which entry points matter most, what eval criteria to define, what trace inputs to use, and what dataset entries to create. Write both the detailed and summary versions of your findings before moving on. **Note**: the project may contain `tests/`, `fixtures/`, `examples/`, mock servers, and documentation — these are the project's own development infrastructure, NOT data sources for your eval pipeline. Ignore them when sourcing trace inputs and dataset content. -> **Checkpoint**: `pixie_qa/01-entry-point.md` written with entry point, execution flow, user-facing interface, and env requirements. +> **Checkpoint**: `pixie_qa/00-project-analysis.md` (detailed, with code references and reasoning chains) and `pixie_qa/00-project-analysis-summary.md` (concise human-readable TLDR) written — both covering what the software does, target users, capability inventory (at least 3 capabilities if the project has them), realistic input characteristics, and hard problems / failure modes (at least 2). -#### Sub-step 1b: Eval criteria +#### Sub-step 1b: Entry point & execution flow -> **Reference**: Read `references/1-b-eval-criteria.md` now. +> **Reference**: Read `references/1-b-entry-point.md` now. -Define the app's use cases and eval criteria. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write your findings to `pixie_qa/02-eval-criteria.md` before moving on. +Read the source code to understand how the app starts and how a real user invokes it. Use the **capability inventory** from `pixie_qa/00-project-analysis.md` to prioritize entry points — focus on the entry point(s) that exercise the most valuable capabilities, not just the first one found. Write both the detailed and summary versions before moving on. -> **Checkpoint**: `pixie_qa/02-eval-criteria.md` written with use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet. +> **Checkpoint**: `pixie_qa/01-entry-point.md` (detailed, with code pointers and execution flow traces) and `pixie_qa/01-entry-point-summary.md` (concise human-readable TLDR) written — both covering entry point, execution flow, user-facing interface, and env requirements. + +#### Sub-step 1c: Eval criteria + +> **Reference**: Read `references/1-c-eval-criteria.md` now. + +Define the app's use cases and eval criteria. Derive use cases from the **capability inventory** in `pixie_qa/00-project-analysis.md`. Derive eval criteria from the **hard problems / failure modes** — not generic quality dimensions. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write both the detailed and summary versions before moving on. + +> **Checkpoint**: `pixie_qa/02-eval-criteria.md` (detailed, with failure-mode traceability and observability chains) and `pixie_qa/02-eval-criteria-summary.md` (concise human-readable TLDR) written — both covering use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet. --- -### Step 2: Instrument with `wrap` and capture a reference trace +### Step 2: Instrument, run application, and capture a reference trace + +Step 2 has three sub-steps. Each reads its own reference file. **Complete each sub-step before starting the next.** + +#### Sub-step 2a: Instrument with `wrap` -> **Reference**: Read `references/2-wrap-and-trace.md` now for the detailed sub-steps. +> **Reference**: Read `references/2a-instrumentation.md` now. -**Goal**: Make the app testable by controlling its external data and capturing its outputs. `wrap()` calls at data boundaries let the test harness inject controlled inputs (replacing real DB/API calls) and capture outputs for scoring. The `Runnable` class provides the lifecycle interface that `pixie test` uses to set up, invoke, and tear down the app. A reference trace captured with `pixie trace` proves the instrumentation works and provides the exact data shapes needed for dataset creation in Step 4. +Add `wrap()` calls at the app's data boundaries so the eval harness can inject controlled inputs and capture outputs. This makes the app testable without changing its logic. -> **Checkpoint**: `pixie_qa/scripts/run_app.py` written and verified. `pixie_qa/reference-trace.jsonl` exists and all expected data points appear when formatted with `pixie format`. Do NOT read Step 3 instructions yet. +> **Checkpoint**: `wrap()` calls added at all data boundaries. Every eval criterion from `pixie_qa/02-eval-criteria.md` has a corresponding data point. + +#### Sub-step 2b: Implement the Runnable + +> **Reference**: Read `references/2b-implement-runnable.md` now. + +Write a Runnable class that lets the eval harness invoke the app exactly as a real user would. The Runnable should be simple — it just wires up the app's real entry point to the harness interface. If it's getting complicated, something is wrong. + +> **Checkpoint**: `pixie_qa/run_app.py` written. The Runnable calls the app's real entry point with real LLM configuration — no mocking, no faking, no component replacement. + +#### Sub-step 2c: Capture and verify a reference trace + +> **Reference**: Read `references/2c-capture-and-verify-trace.md` now. + +Run the app through the Runnable and capture a trace. The trace proves instrumentation and the Runnable are working correctly, and provides the data shapes needed for dataset creation in Step 4. + +> **Checkpoint**: `pixie_qa/reference-trace.jsonl` exists. All expected `wrap` entries and `llm_span` entries appear. `pixie format` shows all data points needed for evaluation. Do NOT read Step 3 instructions yet. --- @@ -93,9 +140,9 @@ Define the app's use cases and eval criteria. Use cases drive dataset creation ( > **Reference**: Read `references/3-define-evaluators.md` now for the detailed sub-steps. -**Goal**: Turn the qualitative eval criteria from Step 1b into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator or a custom one you implement. The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer. +**Goal**: Turn the qualitative eval criteria from Step 1c into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator, an **agent evaluator** (the default for any semantic or qualitative criterion), or a manual custom function (only for mechanical/deterministic checks like regex or field existence). The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer. Select evaluators that measure the **hard problems** identified in `pixie_qa/00-project-analysis.md` — not just generic quality dimensions. -> **Checkpoint**: All evaluators implemented. `pixie_qa/03-evaluator-mapping.md` written with criterion-to-evaluator mapping. Do NOT read Step 4 instructions yet. +> **Checkpoint**: All evaluators implemented. `pixie_qa/03-evaluator-mapping.md` (detailed, with decision rationale) and `pixie_qa/03-evaluator-mapping-summary.md` (concise human-readable TLDR) written with criterion-to-evaluator mapping. Do NOT read Step 4 instructions yet. --- @@ -103,32 +150,33 @@ Define the app's use cases and eval criteria. Use cases drive dataset creation ( > **Reference**: Read `references/4-build-dataset.md` now for the detailed sub-steps. -**Goal**: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names. +**Goal**: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names. Cover entries from the **capability inventory** in `pixie_qa/00-project-analysis.md` and include entries targeting the **failure modes** identified there. **Do NOT use the project's own test fixtures, mock servers, or example data as dataset `eval_input` content** — source real-world data instead. **Every `wrap(purpose="input")` in the app must have pre-captured content in each entry's `eval_input`** — do NOT leave `eval_input` empty when the app has input wraps. -> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/.json` with diverse entries covering all use cases. Do NOT read Step 5 instructions yet. +> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/.json` with diverse entries covering all use cases. **Dataset realism audit passed** — entries use real-world data at representative scale, no project test fixtures contamination, at least one entry targets a failure mode with uncertain outcome, and every `eval_input` has captured content for all input wraps. Do NOT read Step 5 instructions yet. --- -### Step 5: Run evaluation-based tests +### Step 5: Run `pixie test` and fix mechanical issues > **Reference**: Read `references/5-run-tests.md` now for the detailed sub-steps. -**Goal**: Execute the full pipeline end-to-end and verify it produces real scores. This step is about getting the machinery running — fixing any setup or data issues until every dataset entry runs and gets scored. Once tests produce results, run `pixie analyze` for pattern analysis. +**Goal**: Execute the full pipeline end-to-end and get it running without mechanical errors. This step is strictly about fixing setup and data issues in the pixie QA components (dataset, runnable, custom evaluators) — NOT about fixing the application itself or evaluating result quality. Once `pixie test` completes without errors and produces real evaluator scores for every entry, this step is done. -> **Checkpoint**: Tests run and produce real scores. Analysis generated. -> -> If the test errors out, that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable. -> -> **STOP GATE — read this before doing anything else after tests produce scores:** +> **Checkpoint**: `pixie test` runs to completion. Every dataset entry has evaluator scores (real `EvaluationResult` or `PendingEvaluation`). No setup errors, no import failures, no data validation errors. > -> - If the user's original prompt asks only for setup ("set up QA", "add tests", "add evals", "set up evaluations"), **STOP HERE**. Report the test results to the user: "QA setup is complete. Tests show N/M passing. [brief summary]. Want me to investigate the failures and iterate?" Do NOT proceed to Step 6. -> - If the user's original prompt explicitly asks for iteration ("fix", "improve", "debug", "iterate", "investigate failures", "make tests pass"), proceed to Step 6. +> If the test errors out, that's a mechanical bug in your QA components — fix and re-run. But once tests produce scores, move on. Do NOT assess result quality here — that's Step 6. + +**Always proceed to Step 6 after tests produce scores.** Analysis is the essential final step — without it, pending evaluations are never completed and the user gets uninterpreted raw scores with no actionable insights. Do NOT stop here and ask the user whether to continue. --- -### Step 6: Investigate and iterate +### Step 6: Analyze outcomes + +> **Reference**: Read `references/6-analyze-outcomes.md` now — it has the complete three-phase analysis process, writing guidelines, and output format requirements. + +**Goal**: Analyze `pixie test` results in a structured, data-driven process to produce actionable insights on test case quality, evaluator quality, and application quality. This step completes pending evaluations, writes per-entry and per-dataset analysis, and produces a prioritized action plan. Every statement must be backed by concrete data from the evaluation run — no speculation, no hand-waving. -> **Reference**: Read `references/6-investigate.md` now — it has the stop/continue decision, analysis review, root-cause patterns, and investigation procedures. **Follow its instructions before doing any investigation work.** +**Dual-variant output**: Every analysis artifact in this step is produced as two files — a **detailed version** (for agent consumption: data points, evidence trails, reasoning chains) and a **summary version** (for human review: concise TLDR readable in under 2 minutes). Always write the detailed version first, then derive the summary from it. The summary must be a strict subset of the detailed version's content — it should never contain claims not supported in the detailed version. --- diff --git a/skills/eval-driven-dev/references/1-a-project-analysis.md b/skills/eval-driven-dev/references/1-a-project-analysis.md new file mode 100644 index 000000000..92a7d68f7 --- /dev/null +++ b/skills/eval-driven-dev/references/1-a-project-analysis.md @@ -0,0 +1,102 @@ +# Step 1a: Project Analysis + +Before looking at code structure, entry points, or writing any instrumentation, understand what this software does in the real world. This analysis is the foundation for every subsequent step — it determines which entry points to prioritize, what eval criteria to define, what trace inputs to use, and what dataset entries to build. + +--- + +## What to investigate + +Read the project's README, documentation, and top-level source files. You're looking for answers to five questions: + +### 1. What does this software do? + +Write a one-paragraph plain-language summary. What problem does it solve? What does a successful run look like? + +### 2. Who uses it and why? + +Who are the target users? What's the primary use case? What problem does this solve that alternatives don't? This helps you understand what "quality" means for this app — a chatbot that chats with customers has different quality requirements than a research agent that synthesises multi-source reports. + +### 3. Capability inventory + +List the distinct capabilities, modes, or features the app offers. Be specific. for example: + +- For a scraping library: single-page scraping, multi-page scraping, search-based scraping, speech output, script generation +- For a voice agent: greeting, FAQ handling, account lookup, transfer to human, call summarization +- For a research agent: topic research, multi-source synthesis, citation generation, report formatting + +Each capability may need its own entry point, its own trace, and its own dataset entries. This list directly feeds Step 1c (use cases) and Step 4 (dataset diversity). + +### 4. What are realistic inputs? + +Characterize the real-world inputs the app processes — not toy examples: + +- For a web scraper: "messy HTML pages with navigation, ads, dynamic content, tables, nested structures — typically 5KB-500KB of HTML" +- For a research agent: "open-ended research questions requiring multi-source synthesis, with 3-10 sub-questions" +- For a voice agent: "multi-turn conversations with background noise, interruptions, and ambiguous requests" + +Be specific about **scale** (how large), **complexity** (how messy/diverse), and **variety** (what kinds). This directly feeds trace input selection (Step 2) — if you don't characterize realistic inputs here, you'll end up using toy inputs that bypass the app's real logic. + +**This section is an operational constraint, not just documentation.** Steps 2c (trace input) and 4c (dataset entries) will cross-reference these characteristics to verify that trace inputs and dataset entries match real-world scale and complexity. Be concrete and quantitative — write "5KB–500KB HTML pages," not "various HTML pages." + +### 5. What are the hard problems / failure modes? + +What makes this app's job difficult? Where does it fail in practice? These become the most valuable eval scenarios: + +- For a scraper: "malformed HTML, dynamic JS-rendered content, complex nested schemas, very large pages that exceed context windows" +- For a research agent: "conflicting sources, questions requiring multi-step reasoning, hallucinating citations" +- For a voice agent: "ambiguous caller intent, account lookup failures, simultaneous tool calls" + +Each failure mode should map to at least one eval criterion (Step 1c) and at least one dataset entry (Step 4). + +--- + +## Output: `pixie_qa/00-project-analysis.md` + +Write your findings to this file. **Complete all five sections before moving to sub-step 1b.** This document is referenced by every subsequent step. + +### Template + +```markdown +# Project Analysis + +## What this software does + + + +## Target users and value proposition + + + +## Capability inventory + +1. : +2. : +3. ... + +## Realistic input characteristics + + + +## Hard problems and failure modes + +1. : +2. : +3. ... +``` + +### Quality check + +Before moving on, verify: + +- The "What this software does" section describes the app's purpose in terms a non-technical user would understand — not just "it runs a graph" or "it calls OpenAI" +- The capability inventory lists at least 3 capabilities (if the project has them) — if you only found 1, you may have only looked at one part of the codebase +- The realistic input characteristics describe real-world scale and complexity, not the simplest possible input +- The failure modes are specific to this app's domain, not generic ("bad input" is not a failure mode; "malformed HTML with unclosed tags that breaks the parser" is) + +### What to ignore in the project + +The project may contain directories and files that are part of its own development/test infrastructure — `tests/`, `fixtures/`, `examples/`, `mock_server/`, `docs/`, demo scripts, etc. These exist for the project's developers, not for your eval pipeline. + +**Critical**: Do NOT use the project's test fixtures, mock servers, example data, or unit test infrastructure as inputs for your eval traces or dataset entries. They are designed for development speed and isolation — small, clean, deterministic data that bypasses every real-world difficulty. Using them produces trivially easy evaluations that cannot catch real quality issues. + +When you encounter these directories during analysis, note their existence but treat them as implementation details of the project — not as data sources for your QA pipeline. Your QA pipeline must test the app against real-world conditions, not against the project's own test shortcuts. diff --git a/skills/eval-driven-dev/references/1-a-entry-point.md b/skills/eval-driven-dev/references/1-b-entry-point.md similarity index 85% rename from skills/eval-driven-dev/references/1-a-entry-point.md rename to skills/eval-driven-dev/references/1-b-entry-point.md index c5576333c..c70adc752 100644 --- a/skills/eval-driven-dev/references/1-a-entry-point.md +++ b/skills/eval-driven-dev/references/1-b-entry-point.md @@ -1,6 +1,6 @@ -# Step 1a: Entry Point & Execution Flow +# Step 1b: Entry Point & Execution Flow -Identify how the application starts and how a real user invokes it. +Identify how the application starts and how a real user invokes it. Use the **capability inventory** from `pixie_qa/00-project-analysis.md` to prioritize — focus on the entry point(s) that exercise the most valuable and frequently-used capabilities, not just the first one you find. --- @@ -27,7 +27,7 @@ How does a real user or client invoke the app? This is what the eval must exerci ### 3. Environment and configuration -- What env vars does the app require? (API keys, database URLs, feature flags) +- What env vars does the app require? (service endpoints, database URLs, feature flags) - What config files does it read? - What has sensible defaults vs. what must be explicitly set? diff --git a/skills/eval-driven-dev/references/1-b-eval-criteria.md b/skills/eval-driven-dev/references/1-b-eval-criteria.md deleted file mode 100644 index 0550c5681..000000000 --- a/skills/eval-driven-dev/references/1-b-eval-criteria.md +++ /dev/null @@ -1,82 +0,0 @@ -# Step 1b: Eval Criteria - -Define what quality dimensions matter for this app — based on the entry point (`01-entry-point.md`) you've already documented. - -This document serves two purposes: - -1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset. -2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them. - -Keep this concise — it's a planning artifact, not a comprehensive spec. - ---- - -## What to define - -### 1. Use cases - -List the distinct scenarios the app handles. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria. - -**Good use case descriptions:** - -- "Reroute to human agent on account lookup difficulties" -- "Answer billing question using customer's plan details from CRM" -- "Decline to answer questions outside the support domain" -- "Summarize research findings including all queried sub-topics" - -**Bad use case descriptions (too vague):** - -- "Handle billing questions" -- "Edge case" -- "Error handling" - -### 2. Eval criteria - -Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3. - -**Good criteria are specific to the app's purpose.** Examples: - -- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?" -- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?" -- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?" - -**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app. - -At this stage, don't pick evaluator classes or thresholds. That comes in Step 3. - -### 3. Check criteria applicability and observability - -For each criterion: - -1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because: - - **Universal criteria** → become dataset-level default evaluators - - **Case-specific criteria** → become item-level evaluators on relevant rows only - -2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2. - - If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")` - - If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")` - - If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")` - ---- - -## Output: `pixie_qa/02-eval-criteria.md` - -Write your findings to this file. **Keep it short** — the template below is the maximum length. - -### Template - -```markdown -# Eval Criteria - -## Use cases - -1. : -2. ... - -## Eval criteria - -| # | Criterion | Applies to | Data to capture | -| --- | --------- | ------------- | --------------- | -| 1 | ... | All | wrap name: ... | -| 2 | ... | Use case 1, 3 | wrap name: ... | -``` diff --git a/skills/eval-driven-dev/references/1-c-eval-criteria.md b/skills/eval-driven-dev/references/1-c-eval-criteria.md new file mode 100644 index 000000000..9f932f175 --- /dev/null +++ b/skills/eval-driven-dev/references/1-c-eval-criteria.md @@ -0,0 +1,128 @@ +# Step 1c: Eval Criteria + +Define what quality dimensions matter for this app — based on the project analysis (`00-project-analysis.md`) and the entry point (`01-entry-point.md`) you've already documented. + +This document serves two purposes: + +1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset. +2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them. + +**Derive use cases from the capability inventory** in `pixie_qa/00-project-analysis.md`. **Derive eval criteria from the hard problems / failure modes** — not generic quality dimensions like "factuality" or "relevance". + +Keep this concise — it's a planning artifact, not a comprehensive spec. + +--- + +## What to define + +### 1. Use cases + +List the distinct scenarios the app handles. Derive these from the **capability inventory** in `pixie_qa/00-project-analysis.md` — each capability should map to at least one use case. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria. + +When possible, indicate the **expected difficulty level** for each use case — e.g., "routine" for straightforward cases, "challenging" for edge cases or failure-mode scenarios. This guides dataset creation (Step 4) to include entries across a range of difficulty levels rather than clustering at easy cases. + +**Good use case descriptions:** + +- "Reroute to human agent on account lookup difficulties" +- "Answer billing question using customer's plan details from CRM" +- "Decline to answer questions outside the support domain" +- "Summarize research findings including all queried sub-topics" + +**Bad use case descriptions (too vague):** + +- "Handle billing questions" +- "Edge case" +- "Error handling" + +### 2. Eval criteria + +Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3. + +**Good criteria are specific to the app's purpose** and derived from the **hard problems / failure modes** in `pixie_qa/00-project-analysis.md`. Examples: + +- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?" +- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?" +- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?" +- Web scraper: "Does the extracted data match the requested schema fields?", "Does it handle malformed HTML without crashing or losing data?" + +**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app. If your criteria could apply to any chatbot (e.g., "Groundedness", "PromptRelevance"), they're too generic — go back to the failure modes in `00-project-analysis.md` and derive criteria from those. + +At this stage, don't pick evaluator classes or thresholds. That comes in Step 3. + +### 3. Check criteria applicability and observability + +For each criterion: + +1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because: + - **Universal criteria** → become dataset-level default evaluators + - **Case-specific criteria** → become item-level evaluators on relevant rows only + +2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2. + - If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")` + - If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")` + - If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")` + +--- + +## Projects with multiple capabilities + +If the project analysis (`pixie_qa/00-project-analysis.md`) lists multiple capabilities, you should evaluate at minimum the **2-3 most important / commonly used** capabilities. Don't limit the dataset to a single capability when the project's value comes from breadth. + +For each additional capability beyond the first: + +- Add use cases in `02-eval-criteria.md` +- Plan for a separate trace (run `pixie trace` with different entry points / configs) in Step 2 +- Plan dataset entries covering that capability in Step 4 + +If time or context constraints make it impractical to cover all capabilities, **document which ones you covered and which you skipped** (with rationale) at the end of `02-eval-criteria.md`. + +--- + +## Criteria quality gate (mandatory self-check) + +Before writing `02-eval-criteria.md`, run this check on every criterion: + +> **For each criterion, ask: "If the app returned a structurally correct but semantically wrong or hallucinated answer, would this criterion catch it?"** + +- If the answer is "no" for ALL criteria, your criteria set is **structural-only** — it checks plumbing (fields exist, data flowed through) but not quality (content is correct, complete, non-hallucinated). **You must add at least one semantic criterion** that evaluates the _content_ of the app's output, not just its shape. +- Structural criteria (field existence, JSON validity, format checks) are useful but insufficient. They pass even when the app returns fabricated or incorrect data. + +**Examples of structural vs semantic criteria:** + +| Structural (checks shape) | Semantic (checks quality) | +| ------------------------------------------- | -------------------------------------------------------------------------- | +| "Required fields are present in the output" | "Extracted values match the source content — no hallucinated data" | +| "Source type matches expected type" | "The app correctly interpreted noisy input without losing key facts" | +| "Output is valid JSON" | "The summary accurately captures the main points of the document" | +| "Response contains at least N characters" | "The response addresses the user's specific question, not a generic topic" | + +A good criteria set has **both** structural and semantic criteria. Structural criteria catch gross failures (app crashed, returned empty output). Semantic criteria catch quality failures (app ran but returned wrong/hallucinated/incomplete content). + +--- + +## Output: `pixie_qa/02-eval-criteria.md` + +Write your findings to this file. **Keep it short** — the template below is the maximum length. + +### Template + +```markdown +# Eval Criteria + +## Use cases + +1. : +2. ... + +## Eval criteria + +| # | Criterion | Applies to | Data to capture | +| --- | --------- | ------------- | --------------- | +| 1 | ... | All | wrap name: ... | +| 2 | ... | Use case 1, 3 | wrap name: ... | + +## Capability coverage + +Capabilities covered: +Capabilities skipped (with rationale): +``` diff --git a/skills/eval-driven-dev/references/2-wrap-and-trace.md b/skills/eval-driven-dev/references/2-wrap-and-trace.md deleted file mode 100644 index 3efeb3abb..000000000 --- a/skills/eval-driven-dev/references/2-wrap-and-trace.md +++ /dev/null @@ -1,260 +0,0 @@ -# Step 2: Instrument with `wrap` and capture a reference trace - -> For the full `wrap()` API, the `Runnable` class, and CLI commands, see `wrap-api.md`. - -**Why this step**: You need to see the actual data flowing through the app before you can build anything. This step adds `wrap()` calls to mark data boundaries, implements a `Runnable` class, captures a reference trace with `pixie trace`, and verifies all eval criteria can be evaluated. - -This step consolidates three things: (1) data-flow analysis, (2) instrumentation, and (3) writing the runnable. - ---- - -## 2a. Data-flow analysis and `wrap` instrumentation - -Starting from LLM call sites, trace backwards and forwards through the code to find: - -- **Entry input**: what the user sends in (via the entry point) -- **Dependency input**: data from external systems (databases, APIs, caches) -- **App output**: data going out to users or external systems -- **Intermediate state**: internal decisions relevant to evaluation (routing, tool calls) - -For each data point found, **immediately add a `wrap()` call** in the application code: - -```python -import pixie - -# External dependency data — value form (result of a DB/API call) -profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile", - description="Customer profile fetched from database") - -# External dependency data — function form (for lazy evaluation / avoiding the call) -history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history", - description="Conversation history from Redis")(session_id) - -# App output — what the user receives -response = pixie.wrap(response_text, purpose="output", name="response", - description="The assistant's response to the user") - -# Intermediate state — internal decision relevant to evaluation -selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision", - description="Which agent was selected to handle this request") -``` - -### Rules for wrapping - -1. **Wrap at the data boundary** — where data enters or exits the application, not deep inside utility functions -2. **Names must be unique** across the entire application (they are used as registry keys and dataset field names) -3. **Use `lower_snake_case`** for names -4. **Don't wrap LLM call arguments or responses** — those are already captured by OpenInference auto-instrumentation -5. **Don't change the function's interface** — `wrap()` is purely additive, returns the same type - -### Value vs. function wrapping - -```python -# Value form: wrap a data value (result already computed) -profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile") - -# Function form: wrap the callable itself — in eval mode the original function -# is NOT called; the registry value is returned instead. -profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id) -``` - -Use function form when you want to prevent the external call from happening in eval mode (e.g., the call is expensive, has side-effects, or you simply want a clean injection point). In tracing mode, the function is called normally and the result is logged. - -### Coverage check - -After adding `wrap()` calls, go through each eval criterion from `pixie_qa/02-eval-criteria.md` and verify that every required data point has a corresponding wrap call. If a criterion needs data that isn't captured, add the wrap now — don't defer. - -## 2b. Implement the Runnable class - -The `Runnable` class replaces the plain function from older versions of the skill. It exposes three lifecycle methods: - -- **`setup()`** — async, called once before any `run()` call; initialize shared resources here (e.g., an async HTTP client, a DB connection, pre-loaded configuration). Optional — has a default no-op. -- **`run(args)`** — async, called **concurrently** for each dataset entry (up to 4 in parallel); invoke the app's real entry point with `args` (a validated Pydantic model built from `entry_kwargs`). **Must be concurrency-safe** — see below. -- **`teardown()`** — async, called once after all `run()` calls; clean up resources. Optional — has a default no-op. - -**Import resolution**: The project root is automatically added to `sys.path` when your runnable is loaded, so you can use normal `import` statements (e.g., `from app import service`) — no `sys.path` manipulation needed. - -Place the class in `pixie_qa/scripts/run_app.py`: - -```python -# pixie_qa/scripts/run_app.py -from __future__ import annotations -from pydantic import BaseModel -import pixie - - -class AppArgs(BaseModel): - user_message: str - - -class AppRunnable(pixie.Runnable[AppArgs]): - """Runnable that drives the application for tracing and evaluation. - - wrap(purpose="input") calls in the app inject dependency data from the - test registry automatically. wrap(purpose="output"/"state") calls - capture data for evaluation. No manual mocking needed. - """ - - @classmethod - def create(cls) -> AppRunnable: - return cls() - - async def run(self, args: AppArgs) -> None: - from myapp import handle_request - await handle_request(args.user_message) -``` - -**For web servers**, initialize an async HTTP client in `setup()` and use it in `run()`: - -```python -import httpx -from pydantic import BaseModel -import pixie - - -class AppArgs(BaseModel): - user_message: str - - -class AppRunnable(pixie.Runnable[AppArgs]): - _client: httpx.AsyncClient - - @classmethod - def create(cls) -> AppRunnable: - return cls() - - async def setup(self) -> None: - self._client = httpx.AsyncClient(base_url="http://localhost:8000") - - async def run(self, args: AppArgs) -> None: - await self._client.post("/chat", json={"message": args.user_message}) - - async def teardown(self) -> None: - await self._client.aclose() -``` - -**For FastAPI/Starlette apps** (in-process testing without starting a server), use `httpx.ASGITransport` to run the ASGI app directly. This is faster and avoids port management: - -```python -import asyncio -import httpx -from pydantic import BaseModel -import pixie - - -class AppArgs(BaseModel): - user_message: str - - -class AppRunnable(pixie.Runnable[AppArgs]): - _client: httpx.AsyncClient - _sem: asyncio.Semaphore - - @classmethod - def create(cls) -> AppRunnable: - inst = cls() - inst._sem = asyncio.Semaphore(1) # serialise if app uses shared mutable state - return inst - - async def setup(self) -> None: - from myapp.main import app # your FastAPI/Starlette app instance - - # ASGITransport runs the app in-process — no server needed - transport = httpx.ASGITransport(app=app) - self._client = httpx.AsyncClient(transport=transport, base_url="http://test") - - async def run(self, args: AppArgs) -> None: - async with self._sem: - await self._client.post("/chat", json={"message": args.user_message}) - - async def teardown(self) -> None: - await self._client.aclose() -``` - -Choose the right pattern: - -- **Direct function call**: when the app exposes a simple async function (no web framework) -- **`httpx.AsyncClient` with `base_url`**: when you need to test against a running HTTP server -- **`httpx.ASGITransport`**: when the app is FastAPI/Starlette — fastest, no server needed, most reliable for eval - -**Rules**: - -- The `run()` method receives a Pydantic model whose fields are populated from the dataset's `entry_kwargs`. Define a `BaseModel` subclass with the fields your app needs. -- All lifecycle methods (`setup`, `run`, `teardown`) are **async**. -- `run()` must call the app through its real entry point — never bypass request handling. -- Place the file at `pixie_qa/scripts/run_app.py` — name the class `AppRunnable` (or anything descriptive). -- The dataset's `"runnable"` field references the class: `"pixie_qa/scripts/run_app.py:AppRunnable"`. - -**Concurrency**: `run()` is called concurrently for multiple dataset entries (up to 4 in parallel). If the app uses shared mutable state — SQLite, file-based DBs, global caches — you must synchronise access: - -```python -import asyncio - -class AppRunnable(pixie.Runnable[AppArgs]): - _sem: asyncio.Semaphore - - @classmethod - def create(cls) -> AppRunnable: - inst = cls() - inst._sem = asyncio.Semaphore(1) # serialise DB access - return inst - - async def run(self, args: AppArgs) -> None: - async with self._sem: - await call_app(args.message) -``` - -Common concurrency pitfalls: - -- **SQLite**: `sqlite3` connections are not safe for concurrent async writes. Use `Semaphore(1)` to serialise, or switch to `aiosqlite` with WAL mode. -- **Global mutable state**: module-level dicts/lists modified in `run()` need a lock. -- **Rate-limited external APIs**: add a semaphore to avoid 429 errors. - -## 2c. Capture the reference trace with `pixie trace` - -Use the `pixie trace` CLI command to run your `Runnable` and capture a trace file. Pass the entry input as a JSON file: - -```bash -# Create a JSON file with entry kwargs -echo '{"user_message": "a realistic sample input"}' > pixie_qa/sample-input.json - -pixie trace --runnable pixie_qa/scripts/run_app.py:AppRunnable \ - --input pixie_qa/sample-input.json \ - --output pixie_qa/reference-trace.jsonl -``` - -The `--input` flag takes a **file path** to a JSON file (not inline JSON). The JSON object keys become the kwargs passed to the Pydantic model. - -The command calls `AppRunnable.create()`, then `setup()`, then `run(args)` once with the given input, then `teardown()`. The resulting trace is written to the output file. - -The JSONL trace file will contain one line per `wrap()` event and one line per LLM span: - -```jsonl -{"type": "kwargs", "value": {"user_message": "What are your hours?"}} -{"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {...}, ...} -{"type": "llm_span", "request_model": "gpt-4o", "input_messages": [...], ...} -{"type": "wrap", "name": "response", "purpose": "output", "data": "Our hours are...", ...} -``` - -## 2d. Verify wrap coverage with `pixie format` - -Run `pixie format` on the trace file to see the data in dataset-entry format. This shows you both the data shapes and what a real app output looks like: - -```bash -pixie format --input reference-trace.jsonl --output dataset-sample.json -``` - -The output is a formatted dataset entry template — it contains: - -- `entry_kwargs`: the exact keys/values for the runnable arguments -- `eval_input`: the data for all dependencies (from `wrap(purpose="input")` calls) -- `eval_output`: the **actual app output** captured from the trace (this is the real output — use it to understand what the app produces, not as a dataset `eval_output` field) - -For each eval criterion from `pixie_qa/02-eval-criteria.md`, verify the format output contains the data needed to evaluate it. If a data point is missing, go back and add the `wrap()` call. - ---- - -## Output - -- `pixie_qa/scripts/run_app.py` — the `Runnable` class -- `pixie_qa/reference-trace.jsonl` — the reference trace with all expected wrap events diff --git a/skills/eval-driven-dev/references/2a-instrumentation.md b/skills/eval-driven-dev/references/2a-instrumentation.md new file mode 100644 index 000000000..f8b9389fd --- /dev/null +++ b/skills/eval-driven-dev/references/2a-instrumentation.md @@ -0,0 +1,134 @@ +# Step 2a: Instrument with `wrap` + +> For the full `wrap()` API reference, see `wrap-api.md`. + +**Goal**: Add `wrap()` calls at data boundaries so the eval harness can (1) inject controlled inputs in place of real external dependencies, and (2) capture outputs for scoring. + +--- + +## Data-flow analysis + +Starting from LLM call sites, trace backwards and forwards through the code to find: + +- **Dependency input**: data from external systems (databases, APIs, caches, file systems, network fetches) +- **App output**: data going out to users or external systems +- **Intermediate state**: internal decisions relevant to evaluation (routing, tool calls) + +You do **not** need to wrap LLM call arguments or responses — those are already captured by OpenInference auto-instrumentation. + +## Adding `wrap()` calls + +For each data point found, add a `wrap()` call in the application code: + +```python +import pixie + +# External dependency data — function form (prevents the real call in eval mode) +profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile", + description="Customer profile fetched from database")(user_id) + +# External dependency data — function form (prevents the real call in eval mode) +history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history", + description="Conversation history from Redis")(session_id) + +# App output — what the user receives +response = pixie.wrap(response_text, purpose="output", name="response", + description="The assistant's response to the user") + +# Intermediate state — internal decision relevant to evaluation +selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision", + description="Which agent was selected to handle this request") +``` + +### Value vs. function wrapping + +```python +# Value form: wrap a data value (result already computed) +profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile") + +# Function form: wrap the callable — in eval mode the original function is +# NOT called; the registry value is returned instead. +profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id) +``` + +**CRITICAL: Always use function form for `purpose="input"` wraps on external calls** — HTTP requests, database queries, API calls, file reads, cache lookups. Function form prevents the real call from executing in eval mode, so the dataset value is returned directly without making a live network request or database query. Value form still executes the real call first and only replaces the result afterwards — this wastes time, creates flaky tests, and makes evals dependent on external service availability. + +The only case where value form is acceptable for `purpose="input"` is when the wrapped value is a local computation (no I/O, no side effects) that is cheap to recompute. + +### Placement rules + +1. **Wrap at the data boundary** — where data enters or exits the application, not deep inside utility functions. +2. **Names must be unique** across the entire application (used as registry keys and dataset field names). +3. **Use `lower_snake_case`** for names. +4. **Don't change the function's interface** — `wrap()` is purely additive, returns the same type. + +### Placement by purpose + +#### `purpose="input"` — where external data enters + +Place input wraps at the **boundary where external data enters the app**, not at intermediate processing stages. In a pipeline architecture (fetch → process → extract → format): + +- **Correct**: `wrap(fetch_page, purpose="input", name="fetched_page")(url)` using **function form** at the HTTP fetch boundary — in eval mode, the fetch is skipped entirely and the dataset value is returned; in trace mode, the real fetch runs and the result is captured. +- **Incorrect**: `wrap(html_content, purpose="input", name="fetched_page")` using value form — the HTTP fetch still runs in eval mode (wasting time and creating flaky tests), and only the result is replaced afterwards. +- **Incorrect**: `wrap(processed_chunks, purpose="input", name="chunks")` after parsing — eval mode bypasses parsing and chunking entirely. + +**Principle**: `wrap(purpose="input")` replaces the _minimum external dependency_ while exercising the _maximum internal logic_. Push the boundary as far upstream as possible. **Always use function form** for input wraps on external calls — this prevents the real call from executing in eval mode. + +#### `purpose="output"` — where processed data exits + +Track **downstream** from the LLM response to find where data leaves the app — sent to the user, written to storage, rendered in UI, or passed to an external system. Wrap at that exit boundary. + +- Don't wrap raw LLM responses — those are already captured by OpenInference auto-instrumentation as `llm_span` entries. +- Wrap the app's **final processed result** — after any post-processing, formatting, or transformation the app applies to the LLM output. +- If the app has multiple output channels (e.g., a response to the user AND a side-effect write to a database), wrap each one separately. + +```python +# Final response after the app's formatting pipeline +response = pixie.wrap(formatted_response, purpose="output", name="response", + description="Final response sent to the user") + +# Side-effect output — data written to external storage +pixie.wrap(saved_record, purpose="output", name="saved_summary", + description="Summary record saved to the database") +``` + +**Principle**: output wraps are observation-only — they capture what the app produced so evaluators can score it. They are never mocked or injected during eval runs. + +#### `purpose="state"` — internal decisions relevant to evaluation + +Some eval criteria need to judge the app's internal reasoning — not just what went in or came out, but _how_ the app made decisions. Wrap internal state when an eval criterion requires it and the data isn't visible in inputs or outputs. + +Common examples: + +- **Agent routing**: which sub-agent or tool was selected to handle a request +- **Plan/step decisions**: what steps the agent chose to execute +- **Memory updates**: what the agent added to or removed from its working memory +- **Retrieval results**: which documents/chunks were retrieved before being fed to the LLM + +```python +# Agent routing decision +selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision", + description="Which agent was selected to handle this request") + +# Retrieved context fed to LLM +pixie.wrap(retrieved_chunks, purpose="state", name="retrieved_context", + description="Document chunks retrieved by RAG before LLM call") +``` + +**Principle**: only wrap state that an eval criterion actually needs. Don't wrap every variable — state wraps are for internal data that evaluators must see but that doesn't appear in the app's inputs or outputs. + +### Coverage check + +After adding all `wrap()` calls, go through each eval criterion from `pixie_qa/02-eval-criteria.md` and verify: + +1. Every criterion that judges **what went in** has a corresponding `input` or `entry` wrap. +2. Every criterion that judges **what came out** has a corresponding `output` wrap. +3. Every criterion that judges **how the app decided** has a corresponding `state` wrap. + +If a criterion needs data that isn't captured, add the wrap now — don't defer. + +--- + +## Output + +Modified application source files with `wrap()` calls at data boundaries. diff --git a/skills/eval-driven-dev/references/2b-implement-runnable.md b/skills/eval-driven-dev/references/2b-implement-runnable.md new file mode 100644 index 000000000..cb443d33e --- /dev/null +++ b/skills/eval-driven-dev/references/2b-implement-runnable.md @@ -0,0 +1,145 @@ +# Step 2b: Implement the Runnable + +> For the full `Runnable` protocol and `wrap()` API, see `wrap-api.md`. + +**Goal**: Write a Runnable class that lets the eval harness invoke the application exactly as a real user would. + +--- + +## The core idea + +The Runnable is how `pixie test` and `pixie trace` run your application. Think of it as a programmatic stand-in for a real user: it starts the app, sends it a request, and lets the app do its thing. The eval harness calls `run()` for each test case, passing in the user's input parameters. The app processes those parameters through its real code — real routing, real prompt assembly, real LLM calls, real response formatting — and the harness observes what happens via the `wrap()` instrumentation from Step 2a. + +**This means the Runnable should be simple.** It just wires up the app's real entry point to the harness interface. If your Runnable is getting complicated — if you're building custom logic, reimplementing app behavior, or replacing components — something is wrong. + +## Four requirements + +### 1. Run the real production code + +The Runnable calls the app's actual entry point — the same function, class, or endpoint a real user would trigger. It does not reimplement, shortcut, or substitute any part of the application. + +This includes the LLM. The app's LLM calls must go through the real code path — do not mock, fake, or replace application components. The whole point of eval-based testing is that LLM outputs are non-deterministic, so you use evaluators (not assertions) to score them. If you replace any component with a fake, you've eliminated the real behavior and the eval measures nothing. + +**If the app won't run due to missing environment variables or configuration that you cannot resolve, stop and ask the user to fix the environment setup.** Do not work around it by mocking components. + +### 2. Represent start-up args with a Pydantic BaseModel + +The `run()` method receives a Pydantic `BaseModel` whose fields are populated from the dataset's `input_data`. Define a subclass with the fields the app needs: + +```python +from pydantic import BaseModel + +class AppArgs(BaseModel): + user_message: str + # Add more fields as the app's entry point requires. + # These map 1:1 to the dataset input_data keys. +``` + +**The fields must reflect what a real user actually provides.** Read `pixie_qa/00-project-analysis.md` — the "Realistic input characteristics" section describes the complexity, scale, and variety of real inputs. Design the model to accept inputs at that level of realism, not simplified toy versions. + +Understand the boundary between user-provided parameters and world data: + +- **User-provided parameters** (fields on the BaseModel): what a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions. +- **World data** (handled by `wrap(purpose="input")` in Step 2a): content the app fetches from external sources during execution — web pages, database records, API responses. This is NOT part of the BaseModel. + +| App type | BaseModel fields (user provides) | World data (wrap provides) | +| -------------------- | ------------------------------------- | ------------------------------------------------------------------ | +| Web scraper | URL + prompt + schema definition | The HTML page content | +| Research agent | Research question + scope constraints | Source documents, search results | +| Customer support bot | Customer's spoken message | Customer profile from CRM, conversation history from session store | +| Code review tool | PR URL + review criteria | The actual diff, file contents, CI results | + +If a field ends up holding data the app would normally fetch itself, it probably belongs in a `wrap(purpose="input")` call instead of on the BaseModel. + +### 3. Be concurrency-safe + +`run()` is called concurrently for multiple dataset entries (up to 4 in parallel). If the app uses shared mutable state — SQLite, file-based DBs, global caches — protect access with `asyncio.Semaphore`: + +```python +import asyncio + +class AppRunnable(pixie.Runnable[AppArgs]): + _sem: asyncio.Semaphore + + @classmethod + def create(cls) -> "AppRunnable": + inst = cls() + inst._sem = asyncio.Semaphore(1) + return inst + + async def run(self, args: AppArgs) -> None: + async with self._sem: + await call_app(args.message) +``` + +Only add the semaphore when the app actually has shared mutable state. If the app uses per-request state (keyed by unique IDs) or is inherently stateless, concurrent calls are naturally isolated. + +### 4. Adhere to the Runnable interface + +```python +class AppRunnable(pixie.Runnable[AppArgs]): + @classmethod + def create(cls) -> "AppRunnable": ... # construct instance + async def setup(self) -> None: ... # once, before first run() + async def run(self, args: AppArgs) -> None: ... # per dataset entry, concurrent + async def teardown(self) -> None: ... # once, after last run() +``` + +- `create()` — class method, returns a new instance. Use a quoted return type (`-> "AppRunnable"`) to avoid forward reference errors. +- `setup()` — optional async; initialize shared resources (HTTP clients, DB connections, servers). +- `run(args)` — async; called per dataset entry. Invoke the app's real entry point here. +- `teardown()` — optional async; clean up resources from `setup()`. + +## Minimal example + +```python +# pixie_qa/run_app.py +from pydantic import BaseModel +import pixie + + +class AppArgs(BaseModel): + user_message: str + + +class AppRunnable(pixie.Runnable[AppArgs]): + """Drives the application for tracing and evaluation.""" + + @classmethod + def create(cls) -> "AppRunnable": + return cls() + + async def run(self, args: AppArgs) -> None: + from myapp import handle_request + await handle_request(args.user_message) +``` + +That's it. The Runnable imports the app's real entry point and calls it. No custom logic, no component replacement, no clever workarounds. + +## Architecture-specific examples + +Based on how the application runs, read the corresponding example file: + +| App type | Entry point | Example file | +| ----------------------------------- | ----------------------- | ---------------------------------------------------------- | +| **Standalone function** (no server) | Python function | Read `references/runnable-examples/standalone-function.md` | +| **Web server** (FastAPI, Flask) | HTTP/WebSocket endpoint | Read `references/runnable-examples/fastapi-web-server.md` | +| **CLI application** | Command-line invocation | Read `references/runnable-examples/cli-app.md` | + +Read **only** the example file that matches your app type. + +## File placement + +- Place the file at `pixie_qa/run_app.py`. +- The dataset's `"runnable"` field references: `"pixie_qa/run_app.py:AppRunnable"`. +- The project root is automatically on `sys.path`, so use normal imports (`from app import service`). + +## Technical note + +Do NOT use `from __future__ import annotations` in runnable files — it breaks Pydantic's model resolution for nested models. Use quoted return types where needed instead. + +--- + +## Output + +`pixie_qa/run_app.py` — the Runnable class. diff --git a/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md b/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md new file mode 100644 index 000000000..93816dd26 --- /dev/null +++ b/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md @@ -0,0 +1,118 @@ +# Step 2c: Capture and verify a reference trace + +**Goal**: Run the app through the Runnable, capture a trace, and verify that instrumentation and the Runnable are working correctly. The trace proves everything is wired up and provides the exact data shapes needed for dataset creation in Step 4. + +--- + +## Choose the trace input + +The trace input determines what code paths are captured. A trivial input produces a trivial trace that misses the app's real behavior. + +The input must reflect the "Realistic input characteristics" section, according to `pixie_qa/00-project-analysis.md` you've read in step 2b. + +The input has two parts — understand the boundary between them: + +- **User-provided parameters** (you author): What a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions. Write these to be representative of real usage. +- **World data** (captured from production code, not fabricated): Content the app fetches from external sources during execution — database records, API responses, files, etc. Run the production code once to capture this data into the trace. Only resort to synthetic data generation when: + - The user explicitly instructs you to use synthetic data, OR + - Fetching from real sources is impractical (too many fetches, incurs real monetary cost, or takes unreasonably long — more than ~30 minutes) + +**Quick check before writing input**: "Would a real user create this data, or would the app get it from somewhere else?" If the app gets it, let the production code run and capture it. + +| App type | User provides (you author) | World provides (you source) | +| -------------------- | ------------------------------------- | ------------------------------------------------------------------ | +| Web scraper | URL + prompt + schema definition | The HTML page content | +| Research agent | Research question + scope constraints | Source documents, search results | +| Customer support bot | Customer's spoken message | Customer profile from CRM, conversation history from session store | +| Code review tool | PR URL + review criteria | The actual diff, file contents, CI results | + +### Capture multiple traces + +Capture **at least 2 traces** with different input characteristics before building the dataset: + +- Different complexity (simple case vs. complex case) +- Different capabilities (see `00-project-analysis.md` capability inventory) +- Different edge conditions (missing optional data, unusually large input) + +This calibration prevents dataset homogeneity — you see what the app actually does with varied inputs. + +--- + +## Run `pixie trace` + +**First**, verify the app can be imported: `python -c "from import "`. Catch missing packages before entering a trace-install-retry loop. + +```bash +# Create a JSON file with input data +echo '{"user_message": "a realistic sample input"}' > pixie_qa/sample-input.json + +uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable \ + --input pixie_qa/sample-input.json \ + --output pixie_qa/reference-trace.jsonl +``` + +The `--input` flag takes a **file path** to a JSON file (not inline JSON). The JSON keys become kwargs for the Pydantic model. + +For additional traces: + +```bash +uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable \ + --input pixie_qa/sample-input-complex.json \ + --output pixie_qa/trace-complex.jsonl +``` + +--- + +## Verify the trace + +### Quick inspection + +The trace JSONL contains one line per `wrap()` event and one line per LLM span: + +```jsonl +{"type": "kwargs", "value": {"user_message": "What are your hours?"}} +{"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {...}, ...} +{"type": "llm_span", "request_model": "gpt-4o", "input_messages": [...], ...} +{"type": "wrap", "name": "response", "purpose": "output", "data": "Our hours are...", ...} +``` + +Check that: + +- Expected `wrap` entries appear (one per `wrap()` call in the code) +- At least one `llm_span` entry appears (confirms real LLM calls were made) +- Missing entries indicate the execution path was different than expected — fix before continuing + +### Format and verify coverage + +Run `pixie format` to see the data in dataset-entry format: + +```bash +uv run pixie format +``` + +The output shows: + +- `input_data`: the exact keys/values for runnable arguments +- `eval_input`: data from `wrap(purpose="input")` calls +- `eval_output`: the actual app output (from `wrap(purpose="output")`) + +For each eval criterion from `pixie_qa/02-eval-criteria.md`, verify the format output contains the data needed. If a data point is missing, go back to Step 2a and add the `wrap()` call. + +### Trace audit + +Before proceeding to Step 3, audit every trace: + +1. **World data check**: For each `wrap(purpose="input")` field, is the data realistically complex? Compare against `00-project-analysis.md` "Realistic input characteristics." If the analysis says inputs are 5KB–500KB and yours is under 5KB, it's not representative. + +2. **LLM span check**: Do `llm_span` entries appear? If not, the app's LLM calls didn't fire — the Runnable may be misconfigured or the LLM may be mocked/faked. Fix this before continuing. + +3. **Complexity check**: Does the trace exercise the hard problems from `00-project-analysis.md`? If it only exercises the happy path, capture an additional trace with harder inputs. + +If any check fails, go back and fix the input or Runnable, then re-capture. + +--- + +## Output + +- `pixie_qa/reference-trace.jsonl` — reference trace with all expected wrap events and LLM spans +- Additional trace files for varied inputs diff --git a/skills/eval-driven-dev/references/3-define-evaluators.md b/skills/eval-driven-dev/references/3-define-evaluators.md index 20390e7c4..4212d80ec 100644 --- a/skills/eval-driven-dev/references/3-define-evaluators.md +++ b/skills/eval-driven-dev/references/3-define-evaluators.md @@ -6,81 +6,70 @@ ## 3a. Map criteria to evaluators -**Every eval criterion from Step 1b — including any dimensions specified by the user in the prompt — must have a corresponding evaluator.** If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension. +**Every eval criterion from Step 1c — including any dimensions specified by the user in the prompt — must have a corresponding evaluator.** If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension. Prioritize evaluators that measure the **hard problems / failure modes** identified in `pixie_qa/00-project-analysis.md` — these are more valuable than generic quality evaluators. -For each eval criterion, decide how to evaluate it: +For each eval criterion, choose an evaluator using this decision order: -- **Can it be checked with a built-in evaluator?** (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`) -- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion. -- **Is it universal or case-specific?** Universal criteria apply to all dataset items. Case-specific criteria apply only to certain rows. +1. **Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog. +2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?" +3. **Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead. + +**Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural. For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are non-deterministic. -`AnswerRelevancy` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt. +`AnswerRelevancy` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance, use an agent evaluator with clear criteria. ## 3b. Implement custom evaluators If any criterion requires a custom evaluator, implement it now. Place custom evaluators in `pixie_qa/evaluators.py` (or a sub-module if there are many). -### `create_llm_evaluator` factory - -Use when the quality dimension is domain-specific and no built-in evaluator fits. +### Agent evaluators (`create_agent_evaluator`) — the default -The return value is a **ready-to-use evaluator instance**. Assign it to a module-level variable — `pixie test` will import and use it directly (no class wrapper needed): +Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling. ```python -from pixie import create_llm_evaluator - -concise_voice_style = create_llm_evaluator( - name="ConciseVoiceStyle", - prompt_template=""" - You are evaluating whether this response is concise and phone-friendly. +from pixie import create_agent_evaluator + +extraction_accuracy = create_agent_evaluator( + name="ExtractionAccuracy", + criteria="The extracted data accurately reflects the source content. All fields " + "contain correct values from the source — no hallucinated, fabricated, or " + "placeholder values. Compare the final_answer against the fetched_content " + "and parsed_content to verify every claimed fact.", +) - Input: {eval_input} - Response: {eval_output} +noise_handling = create_agent_evaluator( + name="NoiseHandling", + criteria="The app correctly ignored navigation chrome, boilerplate, ads, and other " + "non-content elements from the source. The extracted data contains only " + "information relevant to the user's prompt, not noise from the page structure.", +) - Score 1.0 if the response is concise (under 3 sentences), directly addresses - the question, and uses conversational language suitable for a phone call. - Score 0.0 if it's verbose, off-topic, or uses written-style formatting. - """, +schema_compliance = create_agent_evaluator( + name="SchemaCompliance", + criteria="The output contains all fields requested in the prompt with appropriate " + "types and non-trivial values. Missing fields, null values for required data, " + "or fields with generic placeholder text indicate failure.", ) ``` -Reference the evaluator in your dataset JSON by its `filepath:callable_name` reference (e.g., `"pixie_qa/evaluators.py:concise_voice_style"`). - -**How template variables work**: `{eval_input}`, `{eval_output}`, `{expectation}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field: - -- **Single-item** `eval_input` / `eval_output` → the item's value (string, JSON-serialized dict/list) -- **Multi-item** `eval_input` / `eval_output` → a JSON dict mapping `name → value` for every item +Reference agent evaluators in the dataset via `filepath:callable_name` (e.g., `"pixie_qa/evaluators.py:extraction_accuracy"`). -The LLM judge sees the full serialized value. +During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 5d. -**Rules**: +**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 5d. Make it specific and actionable: -- **Only `{eval_input}`, `{eval_output}`, `{expectation}`** — no nested access like `{eval_input[key]}` (this will crash with a `ValueError`) -- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria. -- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally. +- **Bad**: "Check if the output is good" — too vague to grade consistently +- **Bad**: "The response should be accurate" — doesn't say what to compare against +- **Good**: "Compare the extracted fields against the source HTML/document. Each field must have a corresponding passage in the source. Flag any field whose value cannot be traced back to the source content." +- **Good**: "The app should preserve the structural hierarchy of the source document. If the source has sections/subsections, the extraction should reflect that nesting, not flatten everything into a single level." -**Non-RAG response relevance** (instead of `AnswerRelevancy`): +### Manual custom evaluator — for mechanical checks only -```python -response_relevance = create_llm_evaluator( - name="ResponseRelevance", - prompt_template=""" - You are evaluating whether a customer support response is relevant and helpful. - - Input: {eval_input} - Response: {eval_output} - Expected: {expectation} - - Score 1.0 if the response directly addresses the question and meets expectations. - Score 0.5 if partially relevant but misses important aspects. - Score 0.0 if off-topic, ignores the question, or contradicts expectations. - """, -) -``` +Use manual custom evaluators **only** for deterministic, programmatic checks where a simple function definitively gives the right answer. Examples: field existence, regex matching, JSON schema validation, numeric range checks, type verification. -### Manual custom evaluator +**Do NOT use manual custom evaluators for semantic quality.** If the check requires _judgment_ about whether content is correct, relevant, complete, or well-written, use an agent evaluator instead. The litmus test: "Could a regex, string match, or comparison operator implement this check perfectly?" If not, it's semantic — use an agent evaluator. Custom evaluators can be **sync or async functions**. Assign them to module-level variables in `pixie_qa/evaluators.py`: @@ -119,9 +108,13 @@ def call_ended_check(evaluable: Evaluable, *, trace=None) -> Evaluation: ) ``` +### ValidJSON and string expectations conflict + +`ValidJSON` treats the dataset entry's `expectation` field as a JSON Schema when present. If your entries use **string** expectations (e.g., for `Factuality`), adding `ValidJSON` as a dataset-level default evaluator will cause failures — it cannot validate a plain string as a JSON Schema. Either apply `ValidJSON` only to entries with object/boolean expectations, or omit it when the dataset relies on string expectations. + ## 3c. Produce the evaluator mapping artifact -Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. This artifact bridges between the eval criteria (Step 1b) and the dataset (Step 4). +Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. This artifact bridges between the eval criteria (Step 1c) and the dataset (Step 4). **CRITICAL**: Use the exact evaluator names as they appear in the `evaluators.md` reference — built-in evaluators use their short name (e.g., `Factuality`, `ClosedQA`), and custom evaluators use `filepath:callable_name` format (e.g., `pixie_qa/evaluators.py:ConciseVoiceStyle`). @@ -137,15 +130,22 @@ Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. | Factuality | Factual accuracy | All items | | ClosedQA | Answer correctness | Items with expected_output | -## Custom evaluators +## Agent evaluators + +| Evaluator name | Criterion it covers | Applies to | Source file | +| ------------------------------------------ | ---------------------------- | ---------- | ---------------------- | +| pixie_qa/evaluators.py:extraction_accuracy | Content accuracy vs source | All items | pixie_qa/evaluators.py | +| pixie_qa/evaluators.py:noise_handling | Navigation/boilerplate noise | All items | pixie_qa/evaluators.py | + +## Manual custom evaluators (mechanical checks only) -| Evaluator name | Criterion it covers | Applies to | Source file | -| ---------------------------------------- | ------------------- | ---------- | ---------------------- | -| pixie_qa/evaluators.py:ConciseVoiceStyle | Phone-friendly tone | All items | pixie_qa/evaluators.py | +| Evaluator name | Criterion it covers | Applies to | Source file | +| ---------------------------------------------- | -------------------- | ---------- | ---------------------- | +| pixie_qa/evaluators.py:required_fields_present | Required field check | All items | pixie_qa/evaluators.py | ## Applicability summary -- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:ConciseVoiceStyle +- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:extraction_accuracy - **Item-specific** (apply to subset): ClosedQA (only items with expected_output) ``` @@ -156,6 +156,6 @@ Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. --- -> **Evaluator selection guide**: See `evaluators.md` for the full evaluator catalog, selection guide (which evaluator for which output type), and `create_llm_evaluator` reference. +> **Evaluator selection guide**: See `evaluators.md` for the full built-in evaluator catalog and `create_agent_evaluator` reference. > > **If you hit an unexpected error** when implementing evaluators (import failures, API mismatch), read `evaluators.md` for the authoritative evaluator reference and `wrap-api.md` for API details before guessing at a fix. diff --git a/skills/eval-driven-dev/references/4-build-dataset.md b/skills/eval-driven-dev/references/4-build-dataset.md index c43983946..c6549db3d 100644 --- a/skills/eval-driven-dev/references/4-build-dataset.md +++ b/skills/eval-driven-dev/references/4-build-dataset.md @@ -1,24 +1,23 @@ # Step 4: Build the Dataset -**Why this step**: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b) — into concrete test scenarios. At test time, `pixie test` calls the runnable with `entry_kwargs`, the wrap registry is populated with `eval_input`, and evaluators score the resulting captured outputs. +**Why this step**: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c) — into concrete test scenarios. At test time, `pixie test` calls the runnable with `input_data`, the wrap registry is populated with `eval_input`, and evaluators score the resulting captured outputs. + +**Before building entries**, review: + +- **`pixie_qa/00-project-analysis.md`** — the capability inventory and failure modes. Dataset entries should cover entries from the capability inventory and include entries targeting the listed failure modes. +- **`pixie_qa/02-eval-criteria.md`** — use cases and their capability coverage. Ensure every listed use case has representative entries. --- -## Understanding `entry_kwargs`, `eval_input`, and `expectation` +## Understanding `input_data`, `eval_input`, and `expectation` Before building the dataset, understand what these terms mean: -- **`entry_kwargs`** = the kwargs passed to `Runnable.run()` as a Pydantic model. These are the entry-point inputs (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for `run(args: T)`. +- **`input_data`** = the kwargs passed to `Runnable.run()` as a Pydantic model. These are the input data (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for `run(args: T)`. - **`eval_input`** = a list of `{"name": ..., "value": ...}` objects corresponding to `wrap(purpose="input")` calls in the app. At test time, these are injected automatically by the wrap registry; `wrap(purpose="input")` calls in the app return the registry value instead of calling the real external dependency. - **CRITICAL**: `eval_input` must have **at least one item** (enforced by `min_length=1` validation). If the app has no `wrap(purpose="input")` calls, you must still include at least one `eval_input` item — use the primary entry-point argument as a synthetic input: - - ```json - "eval_input": [ - { "name": "user_input", "value": "What are your business hours?" } - ] - ``` + `eval_input` **may be an empty list** only when the app has no `wrap(purpose="input")` calls. **If the app HAS input wraps, every dataset entry MUST provide corresponding `eval_input` values with pre-captured content** — otherwise the app makes live external calls during eval, which is slow, flaky, and non-reproducible. See section 4b′ for how to capture this content. Each item is a `NamedData` object with `name` (str) and `value` (any JSON-serializable value). @@ -29,7 +28,7 @@ Before building the dataset, understand what these terms mean: The **reference trace** at `pixie_qa/reference-trace.jsonl` is your primary source for data shapes: - Filter it to see the exact serialized format for `eval_input` values -- Read the `kwargs` record to understand the `entry_kwargs` structure +- Read the `kwargs` record to understand the `input_data` structure - Read `purpose="output"/"state"` events to understand what outputs the app produces, so you can write meaningful `expectation` values --- @@ -46,14 +45,14 @@ The eval criteria artifact (`pixie_qa/02-eval-criteria.md`) maps each criterion Use `pixie format` on the reference trace to see the exact data shapes **and** the real app output in dataset-entry format: ```bash -pixie format --input reference-trace.jsonl --output dataset-sample.json +uv run pixie format --input reference-trace.jsonl --output dataset-sample.json ``` The output looks like: ```json { - "entry_kwargs": { + "input_data": { "user_message": "What are your business hours?" }, "eval_input": [ @@ -75,20 +74,157 @@ The output looks like: **Important**: The `eval_output` in this template is the **full real output** produced by the running app. Do NOT copy `eval_output` into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead: -- Use `entry_kwargs` and `eval_input` as exact templates for data keys and format +- Use `input_data` and `eval_input` as exact templates for data keys and format - Look at `eval_output` to understand what the app produces — then write a **concise `expectation` description** that captures the key quality criteria for each scenario **Example**: if `eval_output.response` is `"Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM."`, write `expectation` as `"Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours"` — a short description a human or LLM evaluator can compare against. +## 4b′. Capture external content for `eval_input` (mandatory) + +**CRITICAL**: If the app has ANY `wrap(purpose="input")` calls, every dataset entry MUST provide corresponding `eval_input` values with **pre-captured real content**. An empty `eval_input` list means the app will make live external calls (HTTP requests, database queries, API calls) during every eval run — this makes evals slow, flaky, and non-reproducible. + +### Why this matters + +During `pixie test`, each `wrap(purpose="input", name="X")` call in the app checks the wrap registry for a value named `"X"`: + +- **If found**: the registered value is returned directly (no external call) +- **If not found**: the real external call executes (non-deterministic, slow, may fail) + +An `eval_input: []` entry means NOTHING is in the registry, so every external dependency runs live. This defeats the purpose of instrumentation. + +### How to capture content + +For each `wrap(purpose="input", name="X")` in the app, you must capture the real data once and embed it in the dataset. Choose one of these approaches: + +**Option A — Use the reference trace** (preferred): + +The reference trace from Step 2c already contains captured values for every `purpose="input"` wrap. Extract them: + +```bash +# View the reference trace to find input wrap values +grep '"purpose": "input"' pixie_qa/reference-trace.jsonl +``` + +Or use `pixie format` to see the data in dataset-entry format — the `eval_input` array in the output already has the captured values with correct names and shapes. + +**Option B — Fetch content directly** (for new entries with different inputs): + +When creating dataset entries with different input sources (e.g., different URLs, different queries), capture the content by running the dependency code once: + +```python +# Example: for a web scraper, run the app's own fetch logic once +from myapp.fetcher import fetch_page +page_content = fetch_page(target_url) # use the app's real code path +``` + +Then include the captured content in the entry's `eval_input`: + +```json +{ + "eval_input": [ + { + "name": "fetch_result", + "value": "" + } + ] +} +``` + +**Option C — Run `pixie trace` with each input** (most thorough): + +For each set of `input_data`, run `pixie trace` to execute the app with real dependencies and capture all values: + +```bash +uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input '{"prompt": "...", "source": "..."}' +``` + +Then extract the `purpose="input"` values from the resulting trace and use them as `eval_input`. + +### Content format + +The `eval_input` value must match the **exact type and format** that the `wrap()` call returns. Check the reference trace to see what format the app produces: + +- If the wrap captures a string (e.g., HTML content, markdown text), the value is a string +- If the wrap captures a dict (e.g., database record), the value is a JSON object +- If the wrap captures a list, the value is a JSON array + +**Do NOT skip this step.** Every `wrap(purpose="input")` in the app must have a corresponding `eval_input` entry in every dataset row. If you proceed with empty `eval_input` when the app has input wraps, evals will be unreliable. + ## 4c. Generate dataset items Create diverse entries guided by the reference trace and use cases: -- **`entry_kwargs` keys** must match the fields of the Pydantic model used in `Runnable.run(args: T)` +- **`input_data` keys** must match the fields of the Pydantic model used in `Runnable.run(args: T)` - **`eval_input`** must be a list of `{"name": ..., "value": ...}` objects matching the `name` values of `wrap(purpose="input")` calls in the app - **Cover each use case** from `pixie_qa/02-eval-criteria.md` — at least one entry per use case, with meaningfully diverse inputs across entries -**If the user specified a dataset or data source in the prompt** (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the `entry_kwargs` / `eval_input` shape, and incorporate them into the dataset. Do NOT ignore specified data. +**If the user specified a dataset or data source in the prompt** (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the `input_data` / `eval_input` shape, and incorporate them into the dataset. Do NOT ignore specified data. + +### Entry quality checklist + +Before finalizing the dataset, verify each entry against these criteria: + +**Input realism**: + +- Does `eval_input` contain world data that respects the synthesization boundary (see Step 2c)? User-authored parameters are fine; world data should be sourced, not fabricated from scratch. +- Does the world data in `eval_input` match the scale and complexity described in `00-project-analysis.md` "Realistic input characteristics"? If the analysis says inputs are typically 5KB–500KB, a 200-char input is not realistic. +- Is the answer to the prompt non-trivial to extract from the input? A test where the answer is in a clearly labeled HTML tag or the first sentence doesn't test extraction quality. + +**Scenario diversity**: + +- Do entries cover meaningfully different difficulty levels — not just different topics with the same difficulty? +- Does at least one entry target a failure mode from `00-project-analysis.md` that you expect might actually cause degraded scores (not a guaranteed pass)? +- Do entries use different structural patterns in the input data (not just different content poured into the same template)? + +**Difficulty calibration**: + +- Is there at least one entry you are genuinely uncertain whether the app will handle correctly? If you're confident every entry will pass trivially, the dataset is too easy. +- Consider including one intentionally challenging entry that probes a known limitation — a "stress test" entry. If it passes, great. If it fails, the eval has demonstrated it can catch real issues. + +### Anti-patterns for dataset entries + +- **Fabricating world data**: Hand-authoring content the app would normally fetch from external sources (e.g., writing HTML for a web scraper, writing "retrieved documents" for a RAG system). This removes real-world complexity. +- **Uniform difficulty**: All entries have the same complexity level. Real workloads have a distribution — some easy, some hard, some edge cases. +- **Obvious answers**: Every entry has the target information cleanly labeled and unambiguous. Real data often has the answer scattered, partially present, duplicated with variations, or embedded in noise. +- **Round-trip authorship**: You wrote both the input and the expected output, so you know exactly what's there. A real evaluator tests whether the app can find information it hasn't seen before. +- **Only happy paths**: No entry tests error conditions, edge cases, or known failure modes. +- **Building all entries from the same toy trace with minor rephrasing**: If all entries have similar `input_data` and similar `eval_input` data, the dataset tests nothing meaningful. Each entry should represent a meaningfully different scenario. +- **Reusing the project's own test fixtures as eval data**: The project's `tests/`, `fixtures/`, `examples/`, and `mock_server/` directories contain data designed for unit/integration tests — small, clean, deterministic, and trivially easy. Using them as `eval_input` data guarantees 100% pass rates and zero quality signal. Even if these fixtures look convenient, they bypass every real-world difficulty that makes the app's job hard. **Run the production code to capture realistic data instead**, or generate synthetic data that matches the scale/complexity from `00-project-analysis.md`. +- **Using a project's mock/fake implementations**: If the project includes mock LLMs, fake HTTP servers, or stub services in its test infrastructure, do NOT use them in your eval pipeline. Your eval must exercise the app's real code paths with realistically complex data — not the project's own test shortcuts. + +## 4c′. Verify coverage against project analysis + +Before writing the final dataset JSON, open `pixie_qa/00-project-analysis.md` and check: + +1. **Realistic input characteristics**: For each characteristic listed (size, complexity, noise, variety), confirm at least one dataset entry reflects it. If the analysis says "messy inputs with navigation and ads," at least one entry's `eval_input` should contain messy data with navigation and ads. +2. **Failure modes**: For each failure mode listed, confirm at least one dataset entry is designed to exercise it. The entry doesn't need to guarantee failure — but it should create conditions where that failure mode _could_ manifest. If a failure mode cannot be exercised with the current instrumentation setup, add a note in `02-eval-criteria.md` explaining why. +3. **Capability coverage**: Confirm the dataset covers the capabilities listed in the eval criteria (Step 1c). Each covered capability should have at least one entry. + +If any gap is found, add entries to close it before proceeding to 4d. + +## 4c″. STOP CHECK — Dataset realism audit (hard gate) + +**This is a hard gate.** Do NOT proceed to 4d until every check passes. If any check fails, revise the dataset and re-audit. + +Before writing the final dataset JSON, perform this self-audit: + +1. **Cross-reference `00-project-analysis.md`**: Open the "Realistic input characteristics" section. For each characteristic (size, complexity, noise, structure), verify at least one dataset entry's `eval_input` reflects it. If the analysis says "5KB–500KB HTML pages with navigation chrome and ads" and your largest `eval_input` is 1KB of clean HTML, **the dataset is not realistic — add harder entries.** + +2. **Count distinct sources**: How many unique `eval_input` data sources are in the dataset? If more than 50% of entries share the same `eval_input` content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing. + +3. **Difficulty distribution (mandatory threshold)**: For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode). + - **Maximum 60% "routine" entries.** If you have 5 entries, at most 3 can be routine. + - **At least one "challenging" entry** that targets a failure mode from `00-project-analysis.md` where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one. + +4. **Capability coverage (mandatory threshold)**: Count how many capabilities from `00-project-analysis.md` are exercised by at least one dataset entry. + - **Must cover ≥50% of listed capabilities.** If the analysis lists 6 capabilities, the dataset must exercise at least 3. + - If coverage is below threshold, add entries targeting the uncovered capabilities. + +5. **Project fixture contamination check**: Scan every `eval_input` value. Did any data originate from the project's `tests/`, `fixtures/`, `examples/`, or mock server directories? If yes, **replace it with real-world data.** These fixtures are designed for development convenience, not evaluation realism. + +6. **Tautology check**: Will the test pipeline produce meaningful scores, or is it a closed loop? If you authored both the input data and the evaluator logic such that passing is guaranteed by construction (e.g., regex extractor + exact-match evaluator on hand-authored HTML), **the pipeline is tautological** and cannot catch real issues. The app's real LLM should produce the output, and evaluators should assess quality dimensions that can genuinely fail. + +7. **`eval_input` completeness check**: For every `wrap(purpose="input", name="X")` call in the instrumented app code, verify that EVERY dataset entry provides a corresponding `eval_input` item with `"name": "X"` and a non-empty `"value"`. If any entry has `eval_input: []` while the app has input wraps, **the dataset is incomplete — captured content is missing.** Go back to step 4b′ and capture the content. ## 4d. Build the dataset JSON file @@ -97,11 +233,11 @@ Create the dataset at `pixie_qa/datasets/.json`: ```json { "name": "qa-golden-set", - "runnable": "pixie_qa/scripts/run_app.py:AppRunnable", - "evaluators": ["Factuality", "pixie_qa/evaluators.py:concise_voice_style"], + "runnable": "pixie_qa/run_app.py:AppRunnable", + "evaluators": ["Factuality", "pixie_qa/evaluators.py:ConciseVoiceStyle"], "entries": [ { - "entry_kwargs": { + "input_data": { "user_message": "What are your business hours?" }, "description": "Customer asks about business hours with gold tier account", @@ -114,7 +250,7 @@ Create the dataset at `pixie_qa/datasets/.json`: "expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm" }, { - "entry_kwargs": { + "input_data": { "user_message": "I want to change something" }, "description": "Ambiguous change request from basic tier customer", @@ -128,7 +264,7 @@ Create the dataset at `pixie_qa/datasets/.json`: "evaluators": ["...", "ClosedQA"] }, { - "entry_kwargs": { + "input_data": { "user_message": "I want to end this call" }, "description": "User requests call end after failed verification", @@ -155,8 +291,8 @@ Create the dataset at `pixie_qa/datasets/.json`: ``` entry: - ├── entry_kwargs (required) — args for Runnable.run() - ├── eval_input (required) — list of {"name": ..., "value": ...} objects + ├── input_data (required) — args for Runnable.run() + ├── eval_input (optional) — list of {"name": ..., "value": ...} objects (default: []) ├── description (required) — human-readable label for the test case ├── expectation (optional) — reference for comparison-based evaluators ├── eval_metadata (optional) — extra per-entry data for custom evaluators @@ -165,13 +301,13 @@ entry: **Top-level fields:** -- **`runnable`** (required): `filepath:ClassName` reference to the `Runnable` class from Step 2 (e.g., `"pixie_qa/scripts/run_app.py:AppRunnable"`). Path is relative to the project root. +- **`runnable`** (required): `filepath:ClassName` reference to the `Runnable` class from Step 2 (e.g., `"pixie_qa/run_app.py:AppRunnable"`). Path is relative to the project root. - **`evaluators`** (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases. **Per-entry fields (all top-level on each entry):** -- **`entry_kwargs`** (required): Keys match the Pydantic model fields for `Runnable.run(args: T)`. These are the app's entry-point inputs. -- **`eval_input`** (required): List of `{"name": ..., "value": ...}` objects. Names match `wrap(purpose="input")` names in the app. +- **`input_data`** (required): Keys match the Pydantic model fields for `Runnable.run(args: T)`. These are the app's input data. +- **`eval_input`** (optional, default `[]`): List of `{"name": ..., "value": ...}` objects. Names match `wrap(purpose="input")` names in the app. The runner automatically prepends `input_data` when building the `Evaluable`. - **`description`** (required): Use case one-liner from `pixie_qa/02-eval-criteria.md`. - **`expectation`** (optional): Case-specific expectation text for evaluators that need a reference. - **`eval_metadata`** (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as `evaluable.eval_metadata`. @@ -214,12 +350,14 @@ The `eval_input` values are `{"name": ..., "value": ...}` objects. Use the refer ### Crafting diverse eval scenarios -Cover different aspects of each use case: +Cover different aspects of each use case. Refer to **`pixie_qa/00-project-analysis.md`** for the capability inventory and failure modes: +- **Cover each capability** — at least one entry per capability from the capability inventory, not just the primary capability +- **Target failure modes** — include entries that exercise the hard problems / failure modes listed in the project analysis (e.g., malformed input, edge cases, complex scenarios) - Different user phrasings of the same request - Edge cases (ambiguous input, missing information, error conditions) - Entries that stress-test specific eval criteria -- At least one entry per use case from Step 1b +- At least one entry per use case from Step 1c --- diff --git a/skills/eval-driven-dev/references/5-run-tests.md b/skills/eval-driven-dev/references/5-run-tests.md index a8172a6f0..8124a83a6 100644 --- a/skills/eval-driven-dev/references/5-run-tests.md +++ b/skills/eval-driven-dev/references/5-run-tests.md @@ -1,32 +1,32 @@ -# Step 5: Run Evaluation-Based Tests +# Step 5: Run `pixie test` and Fix Mechanical Issues -**Why this step**: Run `pixie test` and fix any dataset quality issues — `WrapRegistryMissError`, `WrapTypeMismatchError`, bad `eval_input` data, or import failures — until real evaluator scores are produced for every entry. +**Why this step**: Run `pixie test` and fix mechanical issues in your QA components — dataset format problems, runnable implementation bugs, and custom evaluator errors — until every entry produces real scores. This step is NOT about assessing result quality or fixing the application itself. --- ## 5a. Run tests ```bash -pixie test +uv run pixie test ``` For verbose output with per-case scores and evaluator reasoning: ```bash -pixie test -v +uv run pixie test -v ``` `pixie test` automatically loads the `.env` file before running tests. -The test runner now: +The evaluation harness: 1. Resolves the `Runnable` class from the dataset's `runnable` field 2. Calls `Runnable.create()` to construct an instance, then `setup()` once 3. Runs all dataset entries **concurrently** (up to 4 in parallel): - a. Reads `entry_kwargs` and `eval_input` from the entry + a. Reads `input_data` and `eval_input` from the entry b. Populates the wrap input registry with `eval_input` data c. Initialises the capture registry - d. Validates `entry_kwargs` into the Pydantic model and calls `Runnable.run(args)` + d. Validates `input_data` into the Pydantic model and calls `Runnable.run(args)` e. `wrap(purpose="input")` calls in the app return registry values instead of calling external services f. `wrap(purpose="output"/"state")` calls capture data for evaluation g. Builds `Evaluable` from captured data @@ -35,9 +35,11 @@ The test runner now: Because entries run concurrently, the Runnable's `run()` method must be concurrency-safe. If you see `sqlite3.OperationalError`, `"database is locked"`, or similar errors, add a `Semaphore(1)` to your Runnable (see the concurrency section in Step 2 reference). -## 5b. Fix dataset/harness issues +## 5b. Fix mechanical issues only -**Data validation errors** (registry miss, type mismatch, deserialization failure) are reported per-entry with clear messages pointing to the specific `wrap` name and dataset field. This step is about fixing **what you did wrong in Step 4** — bad data, wrong format, missing fields — not about evaluating the app's quality. +This step is strictly about fixing what you built in previous steps — the dataset, the runnable, and any custom evaluators. You are fixing mechanical problems that prevent the pipeline from running, NOT assessing or improving the application's output quality. + +**What counts as a mechanical issue** (fix these): | Error | Cause | Fix | | ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | @@ -46,33 +48,37 @@ Because entries run concurrently, the Runnable's `run()` method must be concurre | Runnable resolution failure | `runnable` path or class name is wrong, or the class doesn't implement the `Runnable` protocol | Fix `filepath:ClassName` in the dataset; ensure the class has `create()` and `run()` methods | | Import error | Module path or syntax error in runnable/evaluator | Fix the referenced file | | `ModuleNotFoundError: pixie_qa` | `pixie_qa/` directory missing `__init__.py` | Run `pixie init` to recreate it | -| `TypeError: ... is not callable` | Evaluator name points to a non-callable attribute | Evaluators must be functions, classes, or callable instances | +| `TypeError: ... is not callable` | Evaluator name points to a non-callable attribute | Evaluators must be functions, classes, or callable instances | | `sqlite3.OperationalError` | Concurrent `run()` calls sharing a SQLite connection | Add `asyncio.Semaphore(1)` to the Runnable (see Step 2 concurrency section) | +| Custom evaluator crashes | Bug in your custom evaluator implementation | Fix the evaluator code | -Iterate — fix errors, re-run, fix the next error — until `pixie test` runs cleanly with real evaluator scores for all entries. - -### When to stop iterating on evaluator results +**What is NOT a mechanical issue** (do NOT fix these here): -Once the dataset runs without errors and produces real scores, assess the results: +- Application produces wrong/low-quality output → that's the application's behavior, analyzed in Step 6 +- Evaluator scores are low → that's a quality signal, analyzed in Step 6 +- LLM calls fail inside the application → report in Step 6, do not mock or work around +- Evaluator scores fluctuate between runs → normal LLM non-determinism, not a bug -- **Custom function evaluators** (deterministic checks): If they fail, the issue is in the dataset data or evaluator logic. Fix and re-run — these should converge quickly. -- **LLM-as-judge evaluators** (e.g., `Factuality`, `ClosedQA`, custom LLM evaluators): These have inherent variance across runs. If scores fluctuate between runs without code changes, the issue is evaluator prompt quality, not app behavior. **Do not spend more than one revision cycle on LLM evaluator prompts.** Run 2–3 times, assess variance, and accept the results if they are directionally correct. -- **General rule**: Stop iterating when all custom function evaluators pass consistently and LLM evaluators produce reasonable scores (most passing). Perfect LLM evaluator scores are not the goal — the goal is a working QA pipeline that catches real regressions. +Iterate — fix errors, re-run, fix the next error — until `pixie test` runs to completion with real evaluator scores for all entries. -## 5c. Run analysis +## Output -Once tests complete without setup errors and produce real scores, run analysis: +After `pixie test` completes successfully, results are stored in the per-entry directory structure: -```bash -pixie analyze +``` +{PIXIE_ROOT}/results// + meta.json # test run metadata + dataset-{idx}/ + metadata.json # dataset name, path, runnable + entry-{idx}/ + config.json # evaluators, description, expectation + eval-input.jsonl # input data fed to evaluators + eval-output.jsonl # output data captured from app + evaluations.jsonl # evaluation results (scored + pending) + trace.jsonl # LLM call traces (if captured) ``` -Where `` is the test run identifier printed by `pixie test` (e.g., `20250615-120000`). This generates LLM-powered markdown analysis for each dataset, identifying patterns in successes and failures. - -## Output - -- Test results at `{PIXIE_ROOT}/results//result.json` -- Analysis files at `{PIXIE_ROOT}/results//dataset-.md` (after `pixie analyze`) +The `` is printed in console output. You will reference this directory in Step 6. --- diff --git a/skills/eval-driven-dev/references/6-analyze-outcomes.md b/skills/eval-driven-dev/references/6-analyze-outcomes.md new file mode 100644 index 000000000..2287601d2 --- /dev/null +++ b/skills/eval-driven-dev/references/6-analyze-outcomes.md @@ -0,0 +1,332 @@ +# Step 6: Analyze Outcomes + +**Why this step**: `pixie test` produced raw scores. Now you analyze those results to understand what they mean — completing pending evaluations, identifying patterns, validating hypotheses, and producing an actionable improvement plan. The analysis is structured in three phases that build on each other: entry-level → dataset-level → action plan. + +--- + +## Result directory structure + +After `pixie test`, the result directory looks like: + +``` +{PIXIE_ROOT}/results// + meta.json + dataset-{idx}/ + metadata.json + entry-{idx}/ + config.json # evaluators, description, expectation + eval-input.jsonl # input data fed to evaluators + eval-output.jsonl # output data captured from app + evaluations.jsonl # scored + pending evaluations + trace.jsonl # LLM call traces +``` + +Read `meta.json` to find the ``. All the data you need for analysis is in this directory. + +--- + +## Writing principles + +Every analysis **detailed** artifact you produce must follow these principles: + +- **Data-driven**: Every opinion or statement must be backed by concrete data from the evaluation run. Quote scores, cite entry indices, reference specific eval input/output content. No hand-waving. It is better to write nothing than to write something unsubstantiated. +- **Evidence-first**: Present the raw data and evidence before drawing conclusions. The reader (another coding agent) should be able to independently verify your conclusions from the evidence you cite. +- **Traceable**: For every conclusion, provide the chain: data source → observation → reasoning → conclusion. Another agent should be able to follow this chain backward to verify or challenge any claim. +- **No selling**: Do not advocate, promote, or use value-laden language ("excellent", "robust", "impressive", "well-designed"). State what the data shows and what actions it implies. Let the reader form quality judgments. +- **Action-oriented**: Every analysis should contribute to the end goal of concrete improvements to the evaluation pipeline or application. Do not write observations that don't lead somewhere. + +Every analysis **summary** artifact must follow these principles: + +- **Concise**: The human reader should be able to understand the key findings and actions in under 2 minutes for any single artifact. +- **Conclusions-first**: Lead with what the reader needs to know (results, findings, actions), not with methodology or background. +- **Plain language**: Avoid jargon. A non-technical stakeholder should be able to follow the summary. +- **Consistent**: Summary conclusions must match the detailed version's evidence. Never add claims in the summary that aren't supported in the detailed version. + +### Dual-variant pattern + +Every analysis artifact in this step has two files: + +| Artifact | Detailed file (for agent) | Summary file (for human) | +| ---------------- | --------------------------- | ----------------------------------- | +| Entry analysis | `entry-{idx}/analysis.md` | `entry-{idx}/analysis-summary.md` | +| Dataset analysis | `dataset-{idx}/analysis.md` | `dataset-{idx}/analysis-summary.md` | +| Action plan | `action-plan.md` | `action-plan-summary.md` | + +**Always write the detailed version first**, then derive the summary from it. The summary is a strict subset of the detailed version's content — it should never contain claims or conclusions not present in the detailed version. + +--- + +## Phase 1: Entry-level analysis + +Process each dataset entry individually. For each `dataset-{idx}/entry-{idx}/`: + +### 1a. Read the entry data + +Read these files for the entry: + +- `config.json` — what evaluators were configured, the description, the expectation +- `eval-input.jsonl` — what data was fed to the app/evaluators +- `eval-output.jsonl` — what the app produced +- `evaluations.jsonl` — current evaluation results (scored and pending) +- `trace.jsonl` — what LLM calls the app made (if available) + +### 1b. Complete pending evaluations + +If `evaluations.jsonl` contains entries with `"status": "pending"`, you must grade them: + +1. Read the `criteria` field of the pending evaluation +2. Apply the criteria to the entry's eval input, eval output, and trace data +3. Assign a **score** between 0.0 and 1.0: + - `1.0` — fully meets the criteria + - `0.5`–`0.9` — partially meets criteria (explain what's missing) + - `0.0`–`0.4` — does not meet criteria +4. Write a **reasoning** string (1–3 sentences citing specific evidence from the output or trace) +5. Replace the pending entry in `evaluations.jsonl` with the scored result: + +**Before** (pending): + +```json +{ + "evaluator": "ResponseQuality", + "status": "pending", + "criteria": "The response should..." +} +``` + +**After** (scored): + +```json +{ + "evaluator": "ResponseQuality", + "score": 0.85, + "reasoning": "Response addresses the main question but omits..." +} +``` + +**Grading guidelines**: + +- Be evidence-based — every score must reference specific output or trace content +- Use the criteria literally — do not expand or reinterpret beyond what's written +- Consider the trace — distinguish between app logic problems and LLM quality issues +- Be calibrated — reserve 1.0 for outputs that genuinely satisfy criteria fully +- Do not penalize LLM non-determinism — different phrasing of a correct answer is not a failure + +### 1c. Write entry-level analysis (two files per entry) + +Produce **two files** per entry. Write the detailed version first, then derive the summary from it. + +#### Detailed version: `dataset-{idx}/entry-{idx}/analysis.md` + +This file is for **agent consumption** — it will be read by the coding agent to further verify conclusions, investigate issues, and take corrective actions. Focus on data points, evidence trails, and the reasoning chain that connects observations to conclusions. + +**Writing principles:** + +- **Present data first, then conclusions.** Start each section with the raw data (scores, output excerpts, trace excerpts), then state what you conclude from it. The reader should be able to verify your conclusion from the data you presented. +- **Quote specific evidence.** When discussing output quality, quote the relevant part of `eval-output.jsonl` or `trace.jsonl`. When discussing evaluator behavior, cite the exact score and reasoning string. +- **Trace issues to root causes.** If an evaluator score is low, trace backward: what did the output look like → what did the LLM produce → what input did the LLM receive → was the input correct? This chain helps the next agent decide where to intervene. +- **Do not make ungrounded claims.** If you can't cite evidence for a statement, don't make it. "The evaluator may be too strict" requires evidence (e.g., "the output contains the correct information but phrased differently, scoring 0.5 instead of 1.0"). +- **Do not sell.** Avoid "excellent", "robust", "impressive". State what happened and what it means. + +**Content for each entry:** + +1. **What this entry tested** — one sentence from the description/input +2. **Raw evaluation data** — table of all evaluator scores with reasoning strings +3. **Output analysis** — key excerpts from `eval-output.jsonl` with observations about quality, correctness, completeness. Quote specific fields/values. +4. **Trace analysis** — relevant excerpts from `trace.jsonl` (LLM calls, token counts, latency) that inform quality assessment +5. **Test case quality assessment** — does this test case effectively exercise the intended capability? Evidence for/against: Is the expectation clear? Are inputs realistic? Would this catch a regression? +6. **Evaluator quality assessment** — for each evaluator: is the score reasonable given the output data? Evidence: compare what the evaluator scored vs what the output actually contains. Would a different input produce a different score (discriminative power)? +7. **Application issues** — problems surfaced, with evidence chain: output excerpt → what went wrong → root cause hypothesis → suggested investigation +8. **Open questions** — anything that couldn't be conclusively determined from this data alone + +#### Summary version: `dataset-{idx}/entry-{idx}/analysis-summary.md` + +This file is for **human review** — a quick-scan view of what happened with this entry. + +**Template:** + +```markdown +# Entry {idx}: + +**Result**: PASS / FAIL + +| Evaluator | Score | Verdict | +| --------- | ----- | ----------------------- | +| ... | ... | OK / Issue: | + +**Key finding**: <1-2 sentences: what worked, what didn't, what action is needed> +``` + +Maximum ~15 lines per entry summary. + +--- + +## Phase 2: Dataset-level analysis + +After all entries in a dataset are analyzed, produce the dataset-level analysis. Write `analysis.md` in the dataset directory (`dataset-{idx}/analysis.md`). + +### 2a. Aggregate the data + +Summarize across all entries in the dataset: + +- Pass/fail counts and overall pass rate +- Per-evaluator statistics (pass rate, min/max/mean scores) +- Which entries failed which evaluators (failure clusters) + +### 2b. Form and validate hypotheses + +Come up with **exactly 3 high-confidence hypotheses** across these three dimensions: + +1. **Test cases quality** — Does the set of test cases sufficiently and efficiently verify the application's capabilities? Does it cover the important failure modes? Are there blind spots? + +2. **Evaluation criteria/evaluator quality** — Do the evaluators have proper granularity and grading to catch real issues? Are there rubber-stamp evaluators (all 1.0)? Are there flaky evaluators (high variance without code changes)? Are criteria too vague or too strict? + +3. **Application quality** — Based on the evaluation results, what are the application's strengths and weaknesses? Where does it produce high-quality output? Where does it fail? + +For each hypothesis: + +- **State the hypothesis** clearly in one sentence +- **Cite the evidence** — entry indices, evaluator names, scores, reasoning quotes, trace data +- **Validate or invalidate** — look at the actual eval input/output data and code to confirm or refute +- **Conclusion** — what action does this hypothesis imply? + +It is always possible to produce 3 hypotheses even when the data is limited. If the evaluation data doesn't give a conclusive answer on application quality, that itself is a signal about test case or evaluator gaps. + +### 2c. Write the dataset analysis (two files) + +Produce **two files** for the dataset analysis. Write the detailed version first, then derive the summary. + +#### Detailed version: `dataset-{idx}/analysis.md` + +This file is for **agent consumption** — it provides the complete data aggregation, hypothesis formation with evidence chains, and validated conclusions that a coding agent can act on directly. + +**Writing principles:** + +- **Show all the data before interpreting it.** Start with the raw aggregation (pass/fail, per-evaluator stats, failure clusters) before any hypotheses. The data should stand on its own. +- **For each hypothesis, present: data → reasoning → conclusion.** The reader should be able to follow your logic step by step and arrive at the same conclusion independently. +- **Cross-reference entry analyses.** When citing evidence, reference the specific entry analysis file and the data points within it (e.g., "Entry 3 analysis shows FactualGrounding=0.5, caused by hallucinated author field — see `entry-3/analysis.md` §Output analysis"). +- **Distinguish correlation from causation.** If two entries fail the same evaluator, that's a pattern. But the root cause might differ — verify by checking the actual output data, don't assume. +- **Do not speculate without marking it.** If a conclusion is uncertain, say "Hypothesis (unvalidated): ..." and explain what additional data would confirm or refute it. + +**Content:** + +1. **Overview** — dataset name, entry count, overall pass rate +2. **Raw aggregation data** + - Per-evaluator statistics table (pass rate, score range, mean, standard deviation) + - Failure matrix: entries × evaluators showing scores, highlighting failures + - Failure clusters: entries grouped by shared failed evaluators +3. **Hypothesis 1: Test cases** — hypothesis statement, evidence with entry/evaluator references, validation steps taken, conclusion with specific action +4. **Hypothesis 2: Evaluators** — same structure +5. **Hypothesis 3: Application** — same structure +6. **Open questions** — anything the data doesn't conclusively answer, with suggestions for what additional data would help + +#### Summary version: `dataset-{idx}/analysis-summary.md` + +This file is for **human review** — a scannable overview of the dataset results, key findings, and recommended actions. + +**Template:** + +```markdown +# Dataset Analysis — Summary + +**Dataset**: | **Entries**: | **Pass rate**: + +## Results at a glance + +| Evaluator | Pass rate | Avg score | Notes | +| --------- | --------- | --------- | ---------------------- | +| ... | ... | ... | | + +## Key findings + +1. : <1-2 sentences with the conclusion and its implication> +2. ... +3. ... + +## Recommended actions (priority order) + +1. : +2. ... +3. ... +``` + +Maximum ~40 lines for the summary. + +--- + +## Phase 3: Action plan (two files) + +After all datasets are analyzed, produce the action plan. Write **two files** at the test run root. Write the detailed version first, then derive the summary. + +### Detailed version: `{PIXIE_ROOT}/results//action-plan.md` + +This file is for **agent consumption** — it provides specific, implementable improvement items with full evidence trails, so a coding agent can pick up any item and execute it without additional context-gathering. + +**Writing principles:** + +- **Each item must be self-contained.** A coding agent reading just one priority item should have enough context (evidence references, file paths, expected changes) to implement it. +- **Trace every item back to evidence.** Each priority must reference: which hypothesis (from which dataset analysis), which entries/evaluators provided the evidence, and what the specific data showed. +- **Be concrete about "How".** Don't say "improve the prompt" — say "In `scrapegraphai/prompts/generate_answer.py` line 45, add instruction: '...'". The more specific, the more actionable. +- **Do not include speculative items.** Every item must have validated evidence. If an item is based on an unvalidated hypothesis, either validate it first or exclude it. + +**Structure:** + +```markdown +# Action Plan (Detailed) + +## Summary + +- X datasets analyzed, Y total entries, Z% overall pass rate +- [1-2 sentence high-level assessment] + +## Priority 1: [Most impactful improvement] + +- **What**: [specific change to make] +- **Why**: [which hypothesis from which dataset analysis, with entry/evaluator references] +- **Evidence**: [specific scores, output excerpts, trace data that support this] +- **Expected impact**: [which entries/evaluators this will improve, and predicted score change] +- **How**: [concrete implementation steps with file paths and line numbers] +- **Verification**: [how to verify the fix worked — which entries to re-run, what scores to expect] + +## Priority 2: ... + +... +``` + +### Summary version: `{PIXIE_ROOT}/results//action-plan-summary.md` + +This file is for **human review** — a prioritized list of improvements that a human can understand and approve in under 2 minutes. + +**Template:** + +```markdown +# Action Plan — Summary + +**Overall**: + +## Actions (priority order) + +1. ****: +2. ****: +3. ... +``` + +Maximum ~30 lines for the summary. + +**Prioritization criteria**: + +- Systemic issues (affecting multiple entries/datasets) before isolated ones +- Issues with clear, validated evidence before speculative ones +- Application quality gaps before evaluator refinements before test case additions +- Quick fixes before large refactors + +The action plan should have 3–5 items. Each must trace back to a validated hypothesis from Phase 2. Do not include items that are speculative or lack evidence. + +--- + +## Process summary + +1. **Phase 1** (per entry): Read data → grade pending evaluations → write `entry-{idx}/analysis.md` + `entry-{idx}/analysis-summary.md` +2. **Phase 2** (per dataset): Aggregate → form 3 hypotheses → validate → write `dataset-{idx}/analysis.md` + `dataset-{idx}/analysis-summary.md` +3. **Phase 3** (per test run): Synthesize → prioritize → write `action-plan.md` + `action-plan-summary.md` + +Process entries within a dataset concurrently (using subagents if available). Process phases sequentially — Phase 2 depends on Phase 1 outputs, Phase 3 depends on Phase 2 outputs. diff --git a/skills/eval-driven-dev/references/6-investigate.md b/skills/eval-driven-dev/references/6-investigate.md deleted file mode 100644 index 15e6a2cb6..000000000 --- a/skills/eval-driven-dev/references/6-investigate.md +++ /dev/null @@ -1,164 +0,0 @@ -# Investigation and Iteration - -This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes. - ---- - -## STOP — check before proceeding - -**Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.** - -**Continue immediately** if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below. - -**Otherwise, STOP here.** Report the test results to the user: - -> "QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?" - -**Do not proceed with investigation until the user confirms.** This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration. - ---- - -## Step-by-step investigation - -When the user has confirmed (or their original prompt was explicitly about iteration), proceed: - -### 1. Read the analysis - -Start by reading the analysis generated in Step 5. The analysis files are at `{PIXIE_ROOT}/results//dataset-.md`. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues. - -### 2. Get detailed test output - -```bash -pixie test -v # shows score and reasoning per case -``` - -Capture the full verbose output. For each failing case, note: - -- The `entry_kwargs` (what was sent) -- The `the captured output` (what the app produced) -- The `expected_output` (what was expected, if applicable) -- The evaluator score and reasoning - -### 3. Inspect the trace data - -For each failing case, look up the full trace to see what happened inside the app: - -```python -from pixie import DatasetStore - -store = DatasetStore() -ds = store.get("") -for i, item in enumerate(ds.items): - print(i, item.eval_metadata) # trace_id is here -``` - -Then inspect the full span tree: - -```python -import asyncio -from pixie import ObservationStore - -async def inspect(trace_id: str): - store = ObservationStore() - roots = await store.get_trace(trace_id) - for root in roots: - print(root.to_text()) # full span tree: inputs, outputs, LLM messages - -asyncio.run(inspect("the-trace-id-here")) -``` - -### 4. Root-cause analysis - -Walk through the trace and identify exactly where the failure originates. Common patterns: - -**LLM-related failures** (fix with prompt/model/eval changes): - -| Symptom | Likely cause | -| ------------------------------------------------------ | ------------------------------------------------------------- | -| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully | -| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous | -| Output format is wrong | Missing format instructions in prompt | -| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage | - -**Non-LLM failures** (fix with traditional code changes, out of eval scope): - -| Symptom | Likely cause | -| ------------------------------------------------- | ------------------------------------------------------- | -| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval | -| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code | -| Database returned stale/wrong records | Data issue — fix independently | -| API call failed with error | Infrastructure issue | - -For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly. - -### 5. Document findings - -**Every failure investigation should be documented** alongside the fix. Include: - -````markdown -### — failure investigation - -**Dataset**: `qa-golden-set` -**Result**: 3/5 cases passed (60%) - -#### Failing case 1: "What rows have extra legroom?" - -- **entry_kwargs**: `{"user_message": "What rows have extra legroom?"}` -- **the captured output**: "I'm sorry, I don't have the exact row numbers for extra legroom..." -- **expected_output**: "rows 5-8 Economy Plus with extra legroom" -- **Evaluator score**: 0.1 (Factuality) -- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..." - -**Trace analysis**: -Inspected trace `abc123`. The span tree shows: - -1. Triage Agent routed to FAQ Agent ✓ -2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓ -3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause** - -**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching. -The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`. -The question "What rows have extra legroom?" contains none of these keywords, so it -falls through to the default "I don't know" response. - -**Classification**: Non-LLM failure — the keyword-matching tool is broken. -The LLM agent correctly routed to the FAQ agent and used the tool; the tool -itself returned wrong data. - -**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in -`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix, -not an eval/prompt change. - -**Verification**: After fix, re-run: - -```bash -pixie test -v # verify -``` -```` - -### 6. Fix and re-run - -Make the targeted change, update the dataset if needed, and re-run: - -```bash -pixie test -v -``` - -After fixes stabilize, run analysis again to see if the patterns have changed: - -```bash -pixie analyze -``` - ---- - -## The iteration cycle - -1. Read analysis from Step 6 → prioritize failures -2. Run tests verbose → identify specific failures -3. Investigate each failure → classify as LLM vs. non-LLM -4. For LLM failures: adjust prompts, model, or eval criteria -5. For non-LLM failures: recommend or apply code fix -6. Update dataset if the fix changed app behavior -7. Re-run tests and analysis -8. Repeat until passing or user is satisfied diff --git a/skills/eval-driven-dev/references/evaluators.md b/skills/eval-driven-dev/references/evaluators.md index 4e9cce89a..4982c4e72 100644 --- a/skills/eval-driven-dev/references/evaluators.md +++ b/skills/eval-driven-dev/references/evaluators.md @@ -1,7 +1,7 @@ # Built-in Evaluators > Auto-generated from pixie source code docstrings. -> Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository. +> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`. Autoevals adapters — pre-made evaluators wrapping `autoevals` scorers. @@ -30,13 +30,14 @@ Public API (all are also re-exported from `pixie.evals`): Choose evaluators based on the **output type** and eval criteria: -| Output type | Evaluator category | Examples | -| -------------------------------------------- | ----------------------------------------------------------- | ------------------------------------- | -| Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON` | Label classification, JSON extraction | -| Open-ended text with a reference answer | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries | -| Text with expected context/grounding | RAG: `Faithfulness`, `ContextRelevancy` | RAG pipelines | -| Text with style/format requirements | Custom via `create_llm_evaluator` | Voice-friendly responses, tone checks | -| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone | +| Output type | Evaluator category | Examples | +| -------------------------------------------- | ----------------------------------------------------------- | -------------------------------------- | +| Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON` | Label classification, JSON extraction | +| Open-ended text with a reference answer | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries | +| Text with expected context/grounding | RAG: `Faithfulness`, `ContextRelevancy` | RAG pipelines | +| Text with style/format requirements | Custom via `create_llm_evaluator` | Voice-friendly responses, tone checks | +| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone | +| Trace-dependent quality (tool use, routing) | Agent evaluator via `create_agent_evaluator` | Tool correctness, multi-step reasoning | Critical rules: @@ -529,3 +530,56 @@ An evaluator callable satisfying the `Evaluator` protocol. Raises: ValueError: If the template uses nested field access like `{eval_input[key]}` (only top-level placeholders are supported). + +### `create_agent_evaluator` + +```python +create_agent_evaluator(name: 'str', criteria: 'str') -> '_AgentEvaluator' +``` + +Create an evaluator whose grading is deferred to a coding agent. + +During `pixie test`, agent evaluators are not scored automatically. +Instead, they raise `AgentEvaluationPending` and record a +`PendingEvaluation` with the evaluation criteria. The coding agent +(guided by Step 5d) reviews each entry's trace and output, then +grades the pending evaluations. + +**When to use**: Quality dimensions that require holistic review of +the LLM trace — tool call correctness, multi-step reasoning quality, +routing decisions — where an automated LLM-as-judge prompt can't +capture the nuance. + +**When NOT to use**: Simple text quality checks (use +`create_llm_evaluator` instead), deterministic checks (use heuristic +evaluators), or any criterion that can be scored from input + output +alone without trace context. + +Args: +name: Display name for the evaluator (shown in scorecard as ⏳ pending). +criteria: What to evaluate — the grading instructions the agent +will follow when reviewing results. Be specific and actionable. + +Returns: +An evaluator callable satisfying the `Evaluator` protocol. Its +`__call__` raises `AgentEvaluationPending` instead of returning an +`Evaluation`. + +Example: + +```python +from pixie import create_agent_evaluator + +ResponseQuality = create_agent_evaluator( + name="ResponseQuality", + criteria="The response directly addresses the user's question with " + "accurate, well-structured information. No hallucinations " + "or off-topic content.", +) + +ToolUsageCorrectness = create_agent_evaluator( + name="ToolUsageCorrectness", + criteria="The app called the correct tools in the right order based " + "on the user's intent. No unnecessary or missed tool calls.", +) +``` diff --git a/skills/eval-driven-dev/references/runnable-examples/cli-app.md b/skills/eval-driven-dev/references/runnable-examples/cli-app.md new file mode 100644 index 000000000..df2a66682 --- /dev/null +++ b/skills/eval-driven-dev/references/runnable-examples/cli-app.md @@ -0,0 +1,64 @@ +# Runnable Example: CLI Application + +**When the app is invoked from the command line** (e.g., `python -m myapp`, a CLI tool with argparse/click). + +**Approach**: Use `asyncio.create_subprocess_exec` to invoke the CLI and capture output. + +```python +# pixie_qa/run_app.py +import asyncio +import sys + +from pydantic import BaseModel +import pixie + + +class AppArgs(BaseModel): + query: str + + +class AppRunnable(pixie.Runnable[AppArgs]): + """Drives a CLI application via subprocess.""" + + @classmethod + def create(cls) -> "AppRunnable": + return cls() + + async def run(self, args: AppArgs) -> None: + proc = await asyncio.create_subprocess_exec( + sys.executable, "-m", "myapp", "--query", args.query, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + ) + stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=120) + if proc.returncode != 0: + raise RuntimeError(f"App failed (exit {proc.returncode}): {stderr.decode()}") +``` + +## When the CLI needs patched dependencies + +If the CLI reads from external services, create a wrapper entry point that patches dependencies before running the real CLI: + +```python +# pixie_qa/patched_app.py +"""Entry point that patches external deps before running the real CLI.""" +import myapp.config as config +config.redis_url = "mock://localhost" + +from myapp.main import main +main() +``` + +Then point your Runnable at the wrapper: + +```python +async def run(self, args: AppArgs) -> None: + proc = await asyncio.create_subprocess_exec( + sys.executable, "-m", "pixie_qa.patched_app", "--query", args.query, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + ) + stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=120) +``` + +**Note**: For CLI apps, `wrap(purpose="input")` injection only works when the app runs in the same process. If using subprocess, you may need to pass test data via environment variables or config files instead. diff --git a/skills/eval-driven-dev/references/runnable-examples/fastapi-web-server.md b/skills/eval-driven-dev/references/runnable-examples/fastapi-web-server.md new file mode 100644 index 000000000..2dba0801e --- /dev/null +++ b/skills/eval-driven-dev/references/runnable-examples/fastapi-web-server.md @@ -0,0 +1,126 @@ +# Runnable Example: FastAPI / Web Server + +**When the app is a web server** (FastAPI, Flask, Starlette) and you need to exercise the full HTTP request pipeline. + +**Approach**: Use `httpx.AsyncClient` with `ASGITransport` to run the ASGI app in-process. This is the fastest and most reliable approach — no subprocess, no port management. + +```python +# pixie_qa/run_app.py +import httpx +from pydantic import BaseModel +import pixie + + +class AppArgs(BaseModel): + user_message: str + + +class AppRunnable(pixie.Runnable[AppArgs]): + """Drives a FastAPI app via in-process ASGI transport.""" + + _client: httpx.AsyncClient + + @classmethod + def create(cls) -> "AppRunnable": + return cls() + + async def setup(self) -> None: + from myapp.main import app # your FastAPI/Starlette app instance + + transport = httpx.ASGITransport(app=app) + self._client = httpx.AsyncClient(transport=transport, base_url="http://test") + + async def run(self, args: AppArgs) -> None: + await self._client.post("/chat", json={"message": args.user_message}) + + async def teardown(self) -> None: + await self._client.aclose() +``` + +## ASGITransport skips lifespan events + +`httpx.ASGITransport` does **not** trigger ASGI lifespan events (`startup` / `shutdown`). If the app initializes resources in its lifespan (database connections, caches, service clients), you must replicate that initialization manually in `setup()`: + +```python +async def setup(self) -> None: + # Manually replicate what the app's lifespan does + from myapp.db import get_connection, init_db, seed_data + import myapp.main as app_module + + conn = get_connection() + init_db(conn) + seed_data(conn) + app_module.db_conn = conn # set the module-level global the app expects + + transport = httpx.ASGITransport(app=app_module.app) + self._client = httpx.AsyncClient(transport=transport, base_url="http://test") + +async def teardown(self) -> None: + await self._client.aclose() + # Clean up the manually-initialized resources + import myapp.main as app_module + if hasattr(app_module, "db_conn") and app_module.db_conn: + app_module.db_conn.close() +``` + +## Concurrency with shared mutable state + +If the app uses shared mutable state (in-memory SQLite, file-based DB, global caches), add a semaphore to serialise access: + +```python +import asyncio + +class AppRunnable(pixie.Runnable[AppArgs]): + _client: httpx.AsyncClient + _sem: asyncio.Semaphore + + @classmethod + def create(cls) -> "AppRunnable": + inst = cls() + inst._sem = asyncio.Semaphore(1) + return inst + + async def setup(self) -> None: + from myapp.main import app + transport = httpx.ASGITransport(app=app) + self._client = httpx.AsyncClient(transport=transport, base_url="http://test") + + async def run(self, args: AppArgs) -> None: + async with self._sem: + await self._client.post("/chat", json={"message": args.user_message}) + + async def teardown(self) -> None: + await self._client.aclose() +``` + +Only use the semaphore when needed — if the app uses per-session state keyed by unique IDs (call_sid, session_id), concurrent calls are naturally isolated and no lock is needed. + +## Alternative: External server with httpx + +When the app can't be imported directly (complex startup, `uvicorn.run()` in `__main__`), start it as a subprocess and hit it with HTTP: + +```python +class AppRunnable(pixie.Runnable[AppArgs]): + _client: httpx.AsyncClient + + @classmethod + def create(cls) -> "AppRunnable": + return cls() + + async def setup(self) -> None: + # Assumes the server is already running (started via run-with-timeout.sh) + self._client = httpx.AsyncClient(base_url="http://localhost:8000") + + async def run(self, args: AppArgs) -> None: + await self._client.post("/chat", json={"message": args.user_message}) + + async def teardown(self) -> None: + await self._client.aclose() +``` + +Start the server before running `pixie trace` or `pixie test`: + +```bash +bash resources/run-with-timeout.sh 120 uv run python -m myapp.server +sleep 3 # wait for readiness +``` diff --git a/skills/eval-driven-dev/references/runnable-examples/standalone-function.md b/skills/eval-driven-dev/references/runnable-examples/standalone-function.md new file mode 100644 index 000000000..a9061656f --- /dev/null +++ b/skills/eval-driven-dev/references/runnable-examples/standalone-function.md @@ -0,0 +1,60 @@ +# Runnable Example: Standalone Function (No Server) + +**When the app is a plain Python function or module** — no web framework, no server, no infrastructure. + +**Approach**: Import and call the function directly from `run()`. This is the simplest case. + +```python +# pixie_qa/run_app.py +from pydantic import BaseModel +import pixie + + +class AppArgs(BaseModel): + question: str + + +class AppRunnable(pixie.Runnable[AppArgs]): + """Drives a standalone function for tracing and evaluation.""" + + @classmethod + def create(cls) -> "AppRunnable": + return cls() + + async def run(self, args: AppArgs) -> None: + from myapp.agent import answer_question + await answer_question(args.question) +``` + +If the function is synchronous, wrap it with `asyncio.to_thread`: + +```python +import asyncio + +async def run(self, args: AppArgs) -> None: + from myapp.agent import answer_question + await asyncio.to_thread(answer_question, args.question) +``` + +If the function depends on an external service (e.g., a vector store), the `wrap(purpose="input")` calls you added in Step 2a handle it automatically — the registry injects test data in eval mode. + +### When to use `setup()` / `teardown()` + +Most standalone functions don't need lifecycle methods. Use them only when the function requires a shared resource (e.g., a pre-loaded embedding model, a database connection): + +```python +class AppRunnable(pixie.Runnable[AppArgs]): + _model: SomeModel + + @classmethod + def create(cls) -> "AppRunnable": + return cls() + + async def setup(self) -> None: + from myapp.models import load_model + self._model = load_model() + + async def run(self, args: AppArgs) -> None: + from myapp.agent import answer_question + await answer_question(args.question, model=self._model) +``` diff --git a/skills/eval-driven-dev/references/testing-api.md b/skills/eval-driven-dev/references/testing-api.md index 29a091d76..d6ac5e392 100644 --- a/skills/eval-driven-dev/references/testing-api.md +++ b/skills/eval-driven-dev/references/testing-api.md @@ -1,7 +1,7 @@ # Testing API Reference > Auto-generated from pixie source code docstrings. -> Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository. +> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`. pixie.evals — evaluation harness for LLM applications. @@ -16,11 +16,11 @@ The dataset is a JSON object with these top-level fields: ```json { "name": "customer-faq", - "runnable": "pixie_qa/scripts/run_app.py:AppRunnable", + "runnable": "pixie_qa/run_app.py:AppRunnable", "evaluators": ["Factuality"], "entries": [ { - "entry_kwargs": { "question": "Hello" }, + "input_data": { "question": "Hello" }, "description": "Basic greeting", "eval_input": [{ "name": "input", "value": "Hello" }], "expectation": "A friendly greeting that offers to help", @@ -36,8 +36,8 @@ All fields are top-level on each entry (flat structure — no nesting): ``` entry: - ├── entry_kwargs (required) — args for Runnable.run() - ├── eval_input (required) — list of {"name": ..., "value": ...} objects + ├── input_data (required) — args for Runnable.run() + ├── eval_input (optional) — list of {"name": ..., "value": ...} objects (default: []) ├── description (required) — human-readable label for the test case ├── expectation (optional) — reference for comparison-based evaluators ├── eval_metadata (optional) — extra per-entry data for custom evaluators @@ -50,13 +50,14 @@ entry: subclass that drives the app during evaluation. - `evaluators` (dataset-level, optional): Default evaluator names — applied to every entry that does not declare its own `evaluators`. -- `entries[].entry_kwargs` (required): Kwargs passed to `Runnable.run()` as a +- `entries[].input_data` (required): Kwargs passed to `Runnable.run()` as a Pydantic model. Keys must match the fields of the Pydantic model used in `run(args: T)`. - `entries[].description` (required): Human-readable label for the test case. -- `entries[].eval_input` (required): List of `{"name": ..., "value": ...}` +- `entries[].eval_input` (optional, default `[]`): List of `{"name": ..., "value": ...}` objects. Used to populate the wrap input registry — `wrap(purpose="input")` - calls in the app return registry values keyed by `name`. + calls in the app return registry values keyed by `name`. The runner + automatically prepends `input_data` when building the `Evaluable`. - `entries[].expectation` (optional): Concise expectation description for comparison-based evaluators. Should describe what a correct output looks like, **not** copy the verbatim output. Use `pixie format` on the trace to @@ -111,13 +112,13 @@ class Evaluable(TestCase): Data carrier for evaluators. Extends `TestCase` with actual output. -- `eval_input` — `list[NamedData]` populated from the entry's `eval_input` field. **Must have at least one item** (`min_length=1`). +- `eval_input` — `list[NamedData]` populated from the entry's `eval_input` field plus `input_data` (prepended by the runner). Always has at least one item. - `eval_output` — `list[NamedData]` containing ALL `wrap(purpose="output")` and `wrap(purpose="state")` values captured during the run. Each item has `.name` (str) and `.value` (JsonValue). Use `_get_output(evaluable, "name")` to look up by name. - `eval_metadata` — `dict[str, JsonValue] | None` from the entry's `eval_metadata` field - `expected_output` — expectation text from dataset (or `UNSET` if not provided) Attributes: -eval_input: Named input data items (from dataset). Must be non-empty. +eval_input: Named input data items (from dataset + input_data prepended by runner). Always non-empty. eval_output: Named output data items (from wrap calls during run). Each item has `.name` (str) and `.value` (JsonValue). Contains ALL `wrap(purpose="output")` and `wrap(purpose="state")` values. @@ -149,7 +150,7 @@ def _get_output(evaluable: Evaluable, name: str) -> Any: return None ``` -**`eval_metadata`** is for passing extra per-entry data to evaluators that isn't an app input or output — e.g., expected tool names, boolean flags, thresholds. Defined as a top-level field on the entry, accessed as `evaluable.eval_metadata`. +**`eval_metadata`** is for passing extra per-entry data to evaluators that isn't an input data or output — e.g., expected tool names, boolean flags, thresholds. Defined as a top-level field on the entry, accessed as `evaluable.eval_metadata`. **Complete custom evaluator example** (tool call check + dataset entry): @@ -179,7 +180,7 @@ Corresponding dataset entry: ```json { - "entry_kwargs": { "user_message": "I want to end this call" }, + "input_data": { "user_message": "I want to end this call" }, "description": "User requests call end after failed verification", "eval_input": [{ "name": "user_input", "value": "I want to end this call" }], "expectation": "Agent should call endCall tool", diff --git a/skills/eval-driven-dev/references/wrap-api.md b/skills/eval-driven-dev/references/wrap-api.md index 574ffadd1..e65c5eb1d 100644 --- a/skills/eval-driven-dev/references/wrap-api.md +++ b/skills/eval-driven-dev/references/wrap-api.md @@ -1,7 +1,7 @@ # Wrap API Reference > Auto-generated from pixie source code docstrings. -> Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository. +> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`. `pixie.wrap` — data-oriented observation API. @@ -24,7 +24,7 @@ processing pipeline. Its behavior depends on the active mode: | Command | Description | | ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | | `pixie trace --runnable --input --output ` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). | -| `pixie format ` | Convert a trace file to a formatted dataset entry template. Shows `entry_kwargs`, `eval_input`, and `eval_output` (the real captured output). | +| `pixie format ` | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). | | `pixie trace filter --purpose input` | Print only wrap events matching the given purposes. Outputs one JSON line per matching event. | --- @@ -42,8 +42,8 @@ class pixie.Runnable(Protocol[T]): async def teardown(self) -> None: ... ``` -Protocol for structured runnables used by the dataset runner. `T` is a -`pydantic.BaseModel` subclass whose fields match the `entry_kwargs` keys +Protocol for structured runnables used by the evaluation harness. `T` is a +`pydantic.BaseModel` subclass whose fields match the `input_data` keys in the dataset JSON. Lifecycle: @@ -54,7 +54,7 @@ Lifecycle: Optional — has a default no-op implementation. 3. `run(args)` — **async**, called **concurrently for each dataset entry** (up to 4 entries in parallel). `args` is a validated Pydantic model - built from `entry_kwargs`. Invoke the application's real entry point. + built from `input_data`. Invoke the application's real entry point. 4. `teardown()` — **async**, called **once** after the last `run()` call. Release any resources acquired in `setup()`. Optional — has a default no-op implementation. @@ -72,7 +72,7 @@ class AppRunnable(pixie.Runnable[AppArgs]): _sem: asyncio.Semaphore @classmethod - def create(cls) -> AppRunnable: + def create(cls) -> "AppRunnable": inst = cls() inst._sem = asyncio.Semaphore(1) # serialise DB access return inst @@ -96,8 +96,7 @@ reference project modules (e.g., `from app import service`). **Example**: ```python -# pixie_qa/scripts/run_app.py -from __future__ import annotations +# pixie_qa/run_app.py from pydantic import BaseModel import pixie @@ -106,7 +105,7 @@ class AppArgs(BaseModel): class AppRunnable(pixie.Runnable[AppArgs]): @classmethod - def create(cls) -> AppRunnable: + def create(cls) -> "AppRunnable": return cls() async def run(self, args: AppArgs) -> None: @@ -128,7 +127,7 @@ class AppRunnable(pixie.Runnable[AppArgs]): _client: httpx.AsyncClient @classmethod - def create(cls) -> AppRunnable: + def create(cls) -> "AppRunnable": return cls() async def setup(self) -> None: diff --git a/skills/eval-driven-dev/resources/setup.sh b/skills/eval-driven-dev/resources/setup.sh index 572366787..e4de415e2 100755 --- a/skills/eval-driven-dev/resources/setup.sh +++ b/skills/eval-driven-dev/resources/setup.sh @@ -2,21 +2,74 @@ # Setup script for eval-driven-dev skill. # Updates the skill, installs/upgrades pixie-qa[all], initializes the # pixie working directory, and starts the web UI server in the background. -# Failures are non-fatal — the workflow continues even if a step here is -# blocked by the environment. +# +# Error handling: +# - Skill update failure → non-fatal (continue with existing version) +# - pixie-qa upgrade failure when already installed → non-fatal +# - pixie-qa NOT installed and install fails → FATAL (exit 1) +# - pixie init failure → FATAL (exit 1) +# - pixie start failure → FATAL (exit 1) set -u echo "=== Updating skill ===" -npx skills update || echo "(skill update skipped)" +npx skills update yiouli/pixie-qa -g -y && npx skills update yiouli/pixie-qa -p -y || { + echo "(skill update failed — proceeding with existing version)" +} echo "" echo "=== Installing / upgrading pixie-qa[all] ===" + +# Helper: check if pixie CLI is importable +_pixie_available() { + if [ -f uv.lock ]; then + uv run python -c "import pixie" 2>/dev/null + elif [ -f poetry.lock ]; then + poetry run python -c "import pixie" 2>/dev/null + else + python -c "import pixie" 2>/dev/null + fi +} + +# Check if pixie is already installed before attempting upgrade +PIXIE_WAS_INSTALLED=false +if _pixie_available; then + PIXIE_WAS_INSTALLED=true +fi + +INSTALL_OK=false if [ -f uv.lock ]; then - uv add "pixie-qa[all]>=0.6.1,<0.7.0" --upgrade + # uv add does universal resolution across all Python versions in + # requires-python. If the host project supports a Python version + # where pixie-qa is unavailable (e.g. 3.10), uv add fails. + # Fall back to uv pip install which only targets the active interpreter. + if uv add "pixie-qa[all]>=0.8.1,<0.9.0" --upgrade 2>&1; then + INSTALL_OK=true + else + echo "(uv add failed — falling back to uv pip install)" + if uv pip install "pixie-qa[all]>=0.8.1,<0.9.0" 2>&1; then + INSTALL_OK=true + fi + fi elif [ -f poetry.lock ]; then - poetry add "pixie-qa[all]>=0.6.1,<0.7.0" + if poetry add "pixie-qa[all]>=0.8.1,<0.9.0"; then + INSTALL_OK=true + fi else - pip install --upgrade "pixie-qa[all]>=0.6.1,<0.7.0" + if pip install --upgrade "pixie-qa[all]>=0.8.1,<0.9.0"; then + INSTALL_OK=true + fi +fi + +if [ "$INSTALL_OK" = false ]; then + if [ "$PIXIE_WAS_INSTALLED" = true ]; then + echo "(pixie-qa upgrade failed — proceeding with existing version)" + else + echo "" + echo "ERROR: pixie-qa is not installed and installation failed." + echo "The eval-driven-dev workflow requires the pixie-qa package." + echo "Please install it manually and re-run this script." + exit 1 + fi fi echo "" @@ -29,6 +82,13 @@ else pixie init fi +if [ $? -ne 0 ]; then + echo "" + echo "ERROR: Failed to initialize pixie working directory." + echo "Please check the error above and fix it before continuing." + exit 1 +fi + echo "" echo "=== Starting web UI server (background) ===" if [ -f uv.lock ]; then @@ -39,5 +99,12 @@ else pixie start fi +if [ $? -ne 0 ]; then + echo "" + echo "ERROR: Failed to start the web UI server." + echo "Please check the error above and fix it before continuing." + exit 1 +fi + echo "" echo "=== Setup complete ===" From 32f91ac190850d1010e1bbd5d81cde00f0d47947 Mon Sep 17 00:00:00 2001 From: yiouli Date: Fri, 17 Apr 2026 15:40:32 -0700 Subject: [PATCH 2/3] fix: update skill update command to use correct repository path --- skills/eval-driven-dev/resources/setup.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/skills/eval-driven-dev/resources/setup.sh b/skills/eval-driven-dev/resources/setup.sh index e4de415e2..9fd04863b 100755 --- a/skills/eval-driven-dev/resources/setup.sh +++ b/skills/eval-driven-dev/resources/setup.sh @@ -12,7 +12,7 @@ set -u echo "=== Updating skill ===" -npx skills update yiouli/pixie-qa -g -y && npx skills update yiouli/pixie-qa -p -y || { +npx skills update github/awesome-copilot --skill eval-driven-dev -g -y && npx skills update github/awesome-copilot --skill eval-driven-dev -p -y || { echo "(skill update failed — proceeding with existing version)" } From fca9a9c3f3381c30646664fe5ebf201881634ba5 Mon Sep 17 00:00:00 2001 From: yiouli Date: Fri, 17 Apr 2026 15:55:38 -0700 Subject: [PATCH 3/3] address comments. --- .../references/2c-capture-and-verify-trace.md | 2 +- .../eval-driven-dev/references/3-define-evaluators.md | 8 ++++---- skills/eval-driven-dev/references/4-build-dataset.md | 4 +++- skills/eval-driven-dev/references/evaluators.md | 2 +- skills/eval-driven-dev/references/wrap-api.md | 10 +++++----- 5 files changed, 14 insertions(+), 12 deletions(-) diff --git a/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md b/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md index 93816dd26..acb4b33ff 100644 --- a/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md +++ b/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md @@ -87,7 +87,7 @@ Check that: Run `pixie format` to see the data in dataset-entry format: ```bash -uv run pixie format +pixie format --input trace.jsonl --output dataset_entry.json ``` The output shows: diff --git a/skills/eval-driven-dev/references/3-define-evaluators.md b/skills/eval-driven-dev/references/3-define-evaluators.md index 4212d80ec..1a79df79c 100644 --- a/skills/eval-driven-dev/references/3-define-evaluators.md +++ b/skills/eval-driven-dev/references/3-define-evaluators.md @@ -11,7 +11,7 @@ For each eval criterion, choose an evaluator using this decision order: 1. **Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog. -2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?" +2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?" 3. **Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead. **Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural. @@ -26,7 +26,7 @@ If any criterion requires a custom evaluator, implement it now. Place custom eva ### Agent evaluators (`create_agent_evaluator`) — the default -Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling. +Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 6, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling. ```python from pixie import create_agent_evaluator @@ -56,9 +56,9 @@ schema_compliance = create_agent_evaluator( Reference agent evaluators in the dataset via `filepath:callable_name` (e.g., `"pixie_qa/evaluators.py:extraction_accuracy"`). -During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 5d. +During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 6. -**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 5d. Make it specific and actionable: +**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 6. Make it specific and actionable: - **Bad**: "Check if the output is good" — too vague to grade consistently - **Bad**: "The response should be accurate" — doesn't say what to compare against diff --git a/skills/eval-driven-dev/references/4-build-dataset.md b/skills/eval-driven-dev/references/4-build-dataset.md index c6549db3d..c1a656604 100644 --- a/skills/eval-driven-dev/references/4-build-dataset.md +++ b/skills/eval-driven-dev/references/4-build-dataset.md @@ -135,7 +135,7 @@ Then include the captured content in the entry's `eval_input`: For each set of `input_data`, run `pixie trace` to execute the app with real dependencies and capture all values: ```bash -uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input '{"prompt": "...", "source": "..."}' +pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input trace-input.json ``` Then extract the `purpose="input"` values from the resulting trace and use them as `eval_input`. @@ -213,10 +213,12 @@ Before writing the final dataset JSON, perform this self-audit: 2. **Count distinct sources**: How many unique `eval_input` data sources are in the dataset? If more than 50% of entries share the same `eval_input` content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing. 3. **Difficulty distribution (mandatory threshold)**: For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode). + - **Maximum 60% "routine" entries.** If you have 5 entries, at most 3 can be routine. - **At least one "challenging" entry** that targets a failure mode from `00-project-analysis.md` where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one. 4. **Capability coverage (mandatory threshold)**: Count how many capabilities from `00-project-analysis.md` are exercised by at least one dataset entry. + - **Must cover ≥50% of listed capabilities.** If the analysis lists 6 capabilities, the dataset must exercise at least 3. - If coverage is below threshold, add entries targeting the uncovered capabilities. diff --git a/skills/eval-driven-dev/references/evaluators.md b/skills/eval-driven-dev/references/evaluators.md index 4982c4e72..b7a409d2d 100644 --- a/skills/eval-driven-dev/references/evaluators.md +++ b/skills/eval-driven-dev/references/evaluators.md @@ -542,7 +542,7 @@ Create an evaluator whose grading is deferred to a coding agent. During `pixie test`, agent evaluators are not scored automatically. Instead, they raise `AgentEvaluationPending` and record a `PendingEvaluation` with the evaluation criteria. The coding agent -(guided by Step 5d) reviews each entry's trace and output, then +(guided by Step 6) reviews each entry's trace and output, then grades the pending evaluations. **When to use**: Quality dimensions that require holistic review of diff --git a/skills/eval-driven-dev/references/wrap-api.md b/skills/eval-driven-dev/references/wrap-api.md index e65c5eb1d..bda76d79b 100644 --- a/skills/eval-driven-dev/references/wrap-api.md +++ b/skills/eval-driven-dev/references/wrap-api.md @@ -21,11 +21,11 @@ processing pipeline. Its behavior depends on the active mode: ## CLI Commands -| Command | Description | -| ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | -| `pixie trace --runnable --input --output ` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). | -| `pixie format ` | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). | -| `pixie trace filter --purpose input` | Print only wrap events matching the given purposes. Outputs one JSON line per matching event. | +| Command | Description | +| ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `pixie trace --runnable --input --output ` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). | +| `pixie format --input --output ` | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). | +| `pixie trace filter --purpose input` | Print only wrap events matching the given purposes. Outputs one JSON line per matching event. | ---