Skip to content

MarcKarbowiak/ai-evaluation-harness

Repository files navigation

AI Evaluation Harness (Production‑Minded, Minimal)

CI

A lightweight evaluation harness for LLM features that produce structured JSON outputs.

It provides:

  • Schema validation
  • Regression scoring (Exact Match + F1)
  • Deterministic mock adapter for CI
  • OpenAI / Azure OpenAI support
  • Quality gates (threshold-based failure)
  • Baseline regression protection

The goal is simple: treat AI features like production software — measurable, testable, and safe to evolve.


Why This Exists

LLM-powered features regress easily when:

  • prompts change
  • models are upgraded
  • decoding parameters change
  • schemas evolve
  • retrieval / RAG logic changes

This harness ensures changes are:

  • measurable
  • reproducible
  • enforceable in CI

Architecture Overview

flowchart LR
  D["Dataset (JSONL)"] --> R["Runner"]
  P["Prompt (vN)"] --> R
  R --> A["Adapter\nmock | openai | azure"]
  A --> O["Structured Output (JSON)"]

  S["JSON Schema"] --> V["Schema Validation"]
  O --> V

  O --> M["Scoring\nExact Match + F1"]
  V --> M

  B["Baseline"] --> G["Regression Gates"]
  M --> G

  G --> RPT["Report (JSON)"]
Loading

Repository Structure

ai-evaluation-harness/
  .github/workflows/ci.yml
  baselines/
  datasets/
  prompts/
  reports/
  schemas/
  src/eval_harness/
  tests/
  run.ps1
  test.ps1

Quickstart (Windows / PowerShell)

This repository includes helper scripts to ensure a consistent local setup.

Run the harness (mock adapter)

From repo root:

.\run.ps1

This will:

  • Create .venv if missing
  • Activate it
  • Install dependencies
  • Run the evaluation harness

run.ps1 Options

Parameter Default Description
-Adapter mock mock, openai, azure
-Dataset datasets\sample_tasks.jsonl JSONL evaluation dataset
-Prompt prompts\task_extraction\v1.md Prompt file
-Schema schemas\task_extraction.schema.json JSON schema file
-MinSchemaValidRate 1.0 Fail if below threshold
-MinAvgF1 0.8 Fail if below threshold
-Baseline baselines\task_extraction.mock.baseline.json Baseline reference
-MaxAvgF1Drop 0.02 Allowed regression delta
-WriteBaseline switch Write current summary as baseline

Baselines

What is a baseline?

A baseline is a committed reference summary representing known-good evaluation performance.

It prevents silent degradation.

Example:

If baseline avg F1 = 0.92

And a change drops it to 0.83

Even if minimum threshold is 0.80 — this is still a regression.

Baseline regression catches this.


Creating or Updating the Baseline

Explicitly run:

.\run.ps1 -WriteBaseline

This writes:

baselines/task_extraction.mock.baseline.json

Commit it.

Baselines are never updated automatically.


When to Update the Baseline

Update only when:

  • Prompt improvements intentionally change output
  • Model upgrades are accepted
  • Dataset changes are deliberate

Do NOT update baseline to “make CI green.”

Treat baseline updates like snapshot test updates — deliberate and reviewed.


Python Virtual Environment (.venv)

What is it?

A Python virtual environment is an isolated Python runtime per project.

It ensures:

  • No global dependency pollution
  • Reproducible installs
  • Alignment with CI

The environment lives in:

.venv/

It is not committed to Git.


Activating manually

.\.venv\Scripts\Activate.ps1

Deactivate:

deactivate

Optional PowerShell helpers

Add to your $PROFILE:

function venv {
    if (!(Test-Path ".\pyproject.toml")) {
        Write-Host "Not repo root" -ForegroundColor Yellow
        return
    }
    if (!(Test-Path ".\.venv\Scripts\Activate.ps1")) {
        python -m venv .venv
    }
    . .\.venv\Scripts\Activate.ps1
}

function devenv {
    if (Get-Command deactivate -ErrorAction SilentlyContinue) {
        deactivate
    }
}

Usage:

cd ai-evaluation-harness
venv

Do I need to activate manually?

No.

Both run.ps1 and test.ps1 automatically create and activate .venv.

Manual activation is only needed for interactive debugging.


Testing

Run tests via:

.\test.ps1

This ensures:

  • .venv exists
  • pytest is installed
  • tests run via python -m pytest

CI Behavior

CI runs:

  • Unit tests
  • Mock adapter quality gate
  • Baseline regression check (if baseline committed)
  • Optional Azure gate if secrets are configured

Reports are uploaded as artifacts.


Example Report Output

Below is a simplified example of a generated report file (reports/run-abc123.json):

{
  "summary": {
    "total": 12,
    "schema_valid_rate": 1.0,
    "exact_match_rate": 0.42,
    "avg_f1": 0.78,
    "avg_latency_ms": 12.4,
    "total_cost_usd": 0.00
  },
  "cases": [
    {
      "id": "case-001",
      "schema_valid": true,
      "exact_match": false,
      "f1": 0.83,
      "latency_ms": 9,
      "cost_usd": 0.0
    },
    {
      "id": "case-002",
      "schema_valid": true,
      "exact_match": true,
      "f1": 1.0,
      "latency_ms": 8,
      "cost_usd": 0.0
    }
  ]
}

The summary section is what quality gates and baseline regression checks compare against.

  • schema_valid_rate ensures structured output correctness.
  • exact_match_rate is strict equality vs expected output.
  • avg_f1 allows partial credit scoring.
  • avg_latency_ms enables performance tracking.
  • total_cost_usd enables future cost gating.

These metrics make LLM behavior measurable and enforceable in CI.


Extending This Harness

Recommended next steps (intentionally not over-engineered):

  • Per-tag thresholds (edge-case vs happy-path)
  • Cost tracking gates
  • Prompt A/B comparison
  • Per-case regression diff reporting

Keep it small, measurable, and production-aligned.


Philosophy

This harness favors:

  • Determinism in CI
  • Explicit baselines
  • Controlled regression
  • Clear failure modes

AI systems drift.

Production systems measure that drift.

About

Production-minded evaluation harness for LLM features with structured outputs. Includes schema validation, regression testing, and repeatable run reports.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors