From 14cf49760e5d4b4142b6b638dbd1fed0017d55e0 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 9 Apr 2026 10:58:47 +0900 Subject: [PATCH 01/13] Add: built-in agent skill for AI coding assistants (agentskills.io) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: AI coding assistants (Claude Code, Codex, Cursor, etc.) have no way to know how to use opendataloader-pdf optimally. Users must manually figure out which of 26 options, 7 formats, and 3 install methods to use for their specific PDF type and downstream pipeline. Approach: Ship a built-in agent skill at skills/odl-pdf/ following the agentskills.io spec. The skill acts as a Document Intelligence Engineer persona that detects the user's environment, recommends optimal options via decision trees, executes conversions directly, and diagnoses quality issues with metric-driven escalation (local → cluster → hybrid → hybrid-mode full). Key design decisions: - Dual-path option reference (built-in summary + dynamic options.json) so pip-install users without source code also get full guidance - Progressive disclosure (L1 description → L2 SKILL.md → L3 references) to minimize token usage - CI drift check (skill-drift-check.yml) catches option/skill mismatch when CLI options change in Java New files: - skills/odl-pdf/ — SKILL.md, 5 references, 3 scripts, evals - skills/README.md — skill creation and maintenance guide - .claude-plugin/marketplace.json — plugin registry - .github/workflows/skill-drift-check.yml — CI drift detection Updated: README.md (Agent Skills section), CLAUDE.md (skill dev notes), CONTRIBUTING.md (skill maintenance guide) Evidence: Ran 5 eval scenarios via independent AI agents with zero prior knowledge of opendataloader-pdf. Each agent loaded only SKILL.md and responded to user queries. All 5 passed must_mention/must_not_mention checks covering RAG pipelines, Korean OCR, table diagnostics, Windows Node.js hybrid setup, and formula+chart enrichment. Drift check (sync-skill-refs.py) confirmed 26/26 options in sync. Co-Authored-By: Claude Opus 4.6 (1M context) --- .claude-plugin/marketplace.json | 20 + .github/workflows/skill-drift-check.yml | 34 + .gitignore | 1 + CLAUDE.md | 12 + CONTRIBUTING.md | 14 + README.md | 25 + skills/README.md | 166 ++++ skills/odl-pdf/SKILL.md | 740 ++++++++++++++++++ skills/odl-pdf/evals/evals.json | 111 +++ skills/odl-pdf/references/eval-metrics.md | 186 +++++ skills/odl-pdf/references/format-guide.md | 61 ++ skills/odl-pdf/references/hybrid-guide.md | 174 ++++ .../odl-pdf/references/installation-matrix.md | 98 +++ skills/odl-pdf/references/options-matrix.md | 235 ++++++ skills/odl-pdf/scripts/detect-env.sh | 172 ++++ skills/odl-pdf/scripts/hybrid-health.sh | 67 ++ skills/odl-pdf/scripts/quick-eval.py | 284 +++++++ skills/odl-pdf/scripts/sync-skill-refs.py | 195 +++++ 18 files changed, 2595 insertions(+) create mode 100644 .claude-plugin/marketplace.json create mode 100644 .github/workflows/skill-drift-check.yml create mode 100644 skills/README.md create mode 100644 skills/odl-pdf/SKILL.md create mode 100644 skills/odl-pdf/evals/evals.json create mode 100644 skills/odl-pdf/references/eval-metrics.md create mode 100644 skills/odl-pdf/references/format-guide.md create mode 100644 skills/odl-pdf/references/hybrid-guide.md create mode 100644 skills/odl-pdf/references/installation-matrix.md create mode 100644 skills/odl-pdf/references/options-matrix.md create mode 100644 skills/odl-pdf/scripts/detect-env.sh create mode 100644 skills/odl-pdf/scripts/hybrid-health.sh create mode 100644 skills/odl-pdf/scripts/quick-eval.py create mode 100644 skills/odl-pdf/scripts/sync-skill-refs.py diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json new file mode 100644 index 000000000..d37103183 --- /dev/null +++ b/.claude-plugin/marketplace.json @@ -0,0 +1,20 @@ +{ + "name": "opendataloader-pdf", + "owner": { + "name": "OpenDataLoader Project" + }, + "metadata": { + "description": "AI-powered PDF extraction guidance and automation", + "version": "1.0.0" + }, + "plugins": [ + { + "name": "odl-pdf-skills", + "description": "Expert guidance for opendataloader-pdf — environment detection, option recommendations, hybrid mode setup, quality diagnostics, and direct conversion execution", + "source": "./", + "skills": [ + "./skills/odl-pdf" + ] + } + ] +} diff --git a/.github/workflows/skill-drift-check.yml b/.github/workflows/skill-drift-check.yml new file mode 100644 index 000000000..cc143a565 --- /dev/null +++ b/.github/workflows/skill-drift-check.yml @@ -0,0 +1,34 @@ +# skill-drift-check.yml +# Ensures skill references stay in sync with options.json when CLI options change. +# Runs sync-skill-refs.py and fails the check if drift is detected (exit code 1). + +name: Skill Drift Check + +on: + push: + paths: + - 'options.json' + pull_request: + paths: + - 'options.json' + workflow_dispatch: + +jobs: + check-drift: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: '3.12' + + - name: Check skill drift + run: | + python skills/odl-pdf/scripts/sync-skill-refs.py + if [ $? -ne 0 ]; then + echo "" + echo "Drift detected: skill references are out of sync with options.json." + echo "Run 'python skills/odl-pdf/scripts/sync-skill-refs.py --fix' locally to update them." + exit 1 + fi diff --git a/.gitignore b/.gitignore index da2799143..ff24dbeb1 100644 --- a/.gitignore +++ b/.gitignore @@ -76,3 +76,4 @@ logs/ .claude/settings.local.json .claude/plans/ +skills/odl-pdf/scripts/__pycache__/ diff --git a/CLAUDE.md b/CLAUDE.md index cf74d03a4..cd5aa1784 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -21,3 +21,15 @@ Hidden text detection (`--filter-hidden-text`) is **off by default** — it requ - `./scripts/bench.sh --check-regression` — CI mode with threshold check - Benchmark code lives in [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) - Metrics: **NID** (reading order), **TEDS** (table structure), **MHS** (heading structure), **Table Detection F1**, **Speed** + +## Agent Skills + +`skills/odl-pdf/` contains the public agent skill shipped with this project. + +When adding or changing CLI options in Java: +1. Run `npm run sync` (regenerates options.json + Python/Node bindings) +2. Update `skills/odl-pdf/references/options-matrix.md` with the new option +3. CI (`skill-drift-check.yml`) will warn if step 2 is missed + +The skill is written in English for external users. Do not include internal +team terminology or company-specific policies. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index ba6d27e09..47117d76f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -134,5 +134,19 @@ git commit -s -m "your message" Make sure your Git config contains your real name and email. +## Agent Skills Maintenance + +This project ships a built-in agent skill at `skills/odl-pdf/`. When you add +or modify CLI options: + +1. Run `npm run sync` as usual +2. Update `skills/odl-pdf/references/options-matrix.md` — add the new option + to the appropriate category with its type, default, and description +3. If the new option has interaction rules with existing options (e.g., requires + another option to be set), document the rule in the "Interaction Rules" section + +The CI workflow `skill-drift-check.yml` will flag any mismatch between +`options.json` and `options-matrix.md`. + Thank you again for helping us improve this project! 🙌 If you have any questions, open an issue or join the discussion. diff --git a/README.md b/README.md index 48285b0bf..aa6d30957 100644 --- a/README.md +++ b/README.md @@ -451,6 +451,31 @@ Existing PDFs (untagged) [PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance) +## Agent Skills + +Your AI coding agent knows how to use opendataloader-pdf — optimal options, +hybrid mode setup, and quality diagnostics, all handled automatically. + +Works with **Claude Code**, **Codex**, **Gemini CLI**, **Cursor**, **VS Code**, and 26+ platforms via [agentskills.io](https://agentskills.io) spec. + +### What the Skill Does + +| Phase | Description | +|-------|-------------| +| **Discover** | Detects your OS, Java, Python, Node.js, and ODL installation | +| **Prescribe** | Recommends optimal install method, options, format, and mode | +| **Execute** | Generates ready-to-run commands or runs conversions directly | +| **Diagnose** | Identifies quality issues and escalates (local → cluster → hybrid) | +| **Optimize** | Tunes batch processing, RAG integration, and performance | + +### Install + +```bash +npx skills add opendataloader-project/opendataloader-pdf --skill odl-pdf +``` + +Or use the `/odl-pdf` slash command in Claude Code after installing the plugin. + ## Roadmap | Feature | Timeline | Tier | diff --git a/skills/README.md b/skills/README.md new file mode 100644 index 000000000..a65467909 --- /dev/null +++ b/skills/README.md @@ -0,0 +1,166 @@ +# Agent Skills + +opendataloader-pdf ships built-in agent skills that help AI coding assistants use this project effectively. Skills follow the [agentskills.io](https://agentskills.io) specification and work with Claude Code, Codex, Gemini CLI, Cursor, VS Code, and 26+ platforms. + +## Directory Structure + +``` +skills/ +├── README.md ← You are here +└── odl-pdf/ ← One skill per directory + ├── SKILL.md ← Main skill file (loaded when activated) + ├── references/ ← Deep-dive docs (loaded on demand) + │ ├── options-matrix.md + │ ├── hybrid-guide.md + │ ├── format-guide.md + │ ├── installation-matrix.md + │ └── eval-metrics.md + ├── scripts/ ← Executable helpers + │ ├── detect-env.sh + │ ├── hybrid-health.sh + │ ├── quick-eval.py + │ └── sync-skill-refs.py + └── evals/ ← Quality test cases + └── evals.json +``` + +## How Skills Work + +### Progressive Disclosure (3 Levels) + +| Level | Content | When Loaded | +|-------|---------|-------------| +| **L1** | `description` field in SKILL.md frontmatter (~100 words) | Always visible to skill router | +| **L2** | SKILL.md body (~400 lines) — persona, workflows, decision trees, gotchas | When skill is activated | +| **L3** | `references/*` files — detailed option matrices, guides, metrics | When the user enters that topic | + +This design minimizes token usage. The AI agent only loads what it needs for the current task. + +### Dual-Path Option Reference + +Skills must work for **both** source-code users and pip-install users: + +- **Built-in summaries** (`references/options-matrix.md`): Always available, even without source code +- **Dynamic reference** (`options.json`): Authoritative source when the source repo is available + +SKILL.md instructs the AI: "If `options.json` exists in this project, it is the source of truth. Options in `options.json` not found in `options-matrix.md` are newly added." + +## Creating a New Skill + +### 1. Create the Directory + +``` +skills/my-skill/ +├── SKILL.md +├── references/ (optional) +├── scripts/ (optional) +└── evals/ (optional) +``` + +### 2. Write SKILL.md + +The SKILL.md file has two parts: + +**Frontmatter** (YAML between `---` markers): + +```yaml +--- +name: my-skill +description: > + One paragraph (~100 words) explaining what this skill does. + Include trigger keywords so the skill router knows when to activate. + Include "Do NOT use for:" to prevent false activations. +--- +``` + +**Body** (Markdown): + +- Define a persona (who the AI becomes when this skill is active) +- Define a workflow (numbered phases the AI follows) +- Include decision trees for common choices +- List critical gotchas the AI must always warn about +- Reference deeper docs with: "See `references/filename.md` for details" + +### 3. Write Evals + +Create `evals/evals.json` with test scenarios: + +```json +{ + "version": "1.0", + "skill": "my-skill", + "evals": [ + { + "id": "eval-001", + "scenario": "Description of the user's situation", + "user_input": "What the user says", + "expected_recommendations": ["What the AI should recommend"], + "must_mention": ["Required terms in the response"], + "must_not_mention": ["Forbidden terms"] + } + ] +} +``` + +### 4. Register in marketplace.json + +Add your skill to `.claude-plugin/marketplace.json`: + +```json +{ + "plugins": [{ + "skills": ["./skills/odl-pdf", "./skills/my-skill"] + }] +} +``` + +### 5. Test + +Test by spawning an AI agent that knows nothing about the project, loading only your SKILL.md, and asking it the eval scenarios. All `must_mention` terms should appear; no `must_not_mention` terms should appear. + +## Modifying the Existing Skill + +### When CLI Options Change + +1. Run `npm run sync` (regenerates `options.json`) +2. Update `skills/odl-pdf/references/options-matrix.md` — add the new option to the appropriate category +3. If the option has interaction rules, document them in the "Interaction Rules" section +4. CI (`skill-drift-check.yml`) will catch any mismatch you miss + +### When Adding a New Hybrid Backend + +1. Update `skills/odl-pdf/references/hybrid-guide.md` — add to the Backend Registry table +2. SKILL.md's decision tree says "check `options.json` for allowed hybrid values" — new backends are auto-discovered + +### When Adding a New Output Format + +1. Update `skills/odl-pdf/references/format-guide.md` — add to the format table with downstream use mapping +2. The format list in `options.json` is auto-discovered by the skill + +## CI Integration + +### Drift Check (`skill-drift-check.yml`) + +Runs automatically when `options.json` changes. Compares option names in `options.json` against `options-matrix.md` and fails if they diverge. + +Run manually: + +```bash +python skills/odl-pdf/scripts/sync-skill-refs.py +``` + +## Writing Guidelines + +- **Language**: English only (external open-source users) +- **No internal terminology**: No company names, team names, or internal tool references +- **Tone**: Senior engineer pair-programming — diagnose first, prescribe later +- **Java guidance**: Always mention Java 11+ requirement. Never recommend specific JDK distributions or download links. +- **Gotchas**: Only include gotchas that affect external users. Internal development gotchas belong in CLAUDE.md. + +## References + +- [agentskills.io specification](https://agentskills.io) — Multi-agent skill format standard +- [Claude Code Skills](https://docs.anthropic.com/en/docs/claude-code) — Claude Code skill documentation +- `.claude-plugin/marketplace.json` — Plugin registration for this project +- `CLAUDE.md` — Internal development notes (not for the skill) +- `CONTRIBUTING.md` — Contributor guidelines including skill maintenance diff --git a/skills/odl-pdf/SKILL.md b/skills/odl-pdf/SKILL.md new file mode 100644 index 000000000..0ee0efc2f --- /dev/null +++ b/skills/odl-pdf/SKILL.md @@ -0,0 +1,740 @@ +--- +name: odl-pdf +description: > + Expert PDF extraction guidance for opendataloader-pdf. Detects your environment, + recommends optimal options, runs hybrid mode setup, diagnoses quality issues, + and executes conversions directly. Use when: 'PDF extraction', 'PDF to markdown', + 'PDF to JSON', 'PDF to HTML', 'opendataloader', 'ODL', 'hybrid mode', + 'scanned PDF', 'OCR', 'PDF tables', 'RAG pipeline with PDF', 'PDF accessibility', + 'PDF/UA'. Do NOT use for: PDF merge/split/rotate, Word/Excel conversion, + PDF form filling. +--- + +# Targets: opendataloader-pdf >= 2.2.0 +# Last synced options.json: 26 options + +--- + +## Persona + +You are a **Document Intelligence Engineer** — not merely a PDF expert, but an engineer who understands the full extraction pipeline from raw PDF bytes to downstream consumption. + +**What that means in practice:** + +- You understand PDF internals: structure trees, bounding boxes, content streams, reading order algorithms, and the difference between digital and scanned PDFs. +- You understand real-world extraction workflows: batch processing patterns, error triage, quality measurement with NID/TEDS/MHS metrics. +- You are aware of downstream systems: RAG chunking strategies, LLM context window constraints, LangChain document loaders, vector store ingestion. +- You understand cross-platform deployment: Java 11+ JVM requirements, OS-specific quirks, server/client architecture for hybrid mode. + +**Interaction style:** Diagnose first, prescribe later. Like a senior engineer pair programming — ask probing questions to understand the user's actual situation before recommending options. Evidence-based recommendations grounded in benchmarks, not guesswork. + +--- + +## Five-Phase Workflow + +Every session follows this sequence. Never skip Phase 1. Phases 3-5 are entered as needed. + +``` +Phase 1: DISCOVER → Understand environment and requirements +Phase 2: PRESCRIBE → Recommend installation, options, and architecture +Phase 3: EXECUTE → Generate or run commands +Phase 4: DIAGNOSE → Identify and fix quality problems +Phase 5: OPTIMIZE → Tune for production at scale +``` + +--- + +## Phase 1: DISCOVER + +**Always run this phase first, regardless of what the user asked.** + +### 1A. Environment Detection + +If `scripts/detect-env.sh` is available in the project, run it first: + +```bash +bash skills/odl-pdf/scripts/detect-env.sh +``` + +The script outputs key=value pairs. Parse these fields: + +| Key | Meaning | +|-----|---------| +| `OS` | Operating system (linux, macos, windows) | +| `JAVA` | Java version detected (e.g., `17.0.9`) or `missing` | +| `PYTHON` | Python version or `missing` | +| `NODE` | Node.js version or `missing` | +| `ODL_INSTALLED` | `true` or `false` | +| `ODL_VERSION` | Installed version (e.g., `2.3.1`) or `none` | +| `HYBRID_EXTRAS` | `true` if `[hybrid]` extras are installed | + +If the script is not available, ask the user directly: +- What OS are you on? (Linux / macOS / Windows) +- Do you have Java installed? Run: `java -version` +- Which languages/runtimes are available? (Python, Node.js, Java project) +- Is opendataloader-pdf already installed? + +### 1B. Requirements Gathering + +Ask these four questions (can be combined in one message): + +1. **PDF type** — Are these digital PDFs (text selectable), scanned/image-only PDFs, or mixed? Do they contain complex tables, formulas, or charts? +2. **Volume** — How many PDFs, and roughly how many pages each? One-off or ongoing batch? +3. **Downstream use** — Where does the extracted content go? (RAG system, LangChain, web display, search index, manual review, LLM input) +4. **Quality requirements** — Is this best-effort extraction or does accuracy matter critically? Are there specific elements (tables, headings, reading order) that must be correct? + +**Do not proceed to Phase 2 without answers to at least questions 1 and 3.** + +--- + +## Phase 2: PRESCRIBE + +Based on Phase 1 findings, make specific recommendations across four dimensions. + +### 2A. Installation + +> Load `references/installation-matrix.md` when advising on installation for a specific environment. + +**Decision tree:** + +``` +Environment detection: +├── Python available? +│ ├── Complex tables / OCR / formulas needed? +│ │ └── pip install "opendataloader-pdf[hybrid]" +│ ├── LangChain RAG pipeline? +│ │ └── pip install langchain-opendataloader-pdf +│ └── Simple extraction (digital PDFs, standard tables) +│ └── pip install opendataloader-pdf +├── Node.js only? +│ └── npm install @opendataloader/pdf +├── Java project (Maven/Gradle)? +│ └── Add Maven dependency (see references/installation-matrix.md) +└── Unsure / getting started? + └── pip install opendataloader-pdf (simplest path) +``` + +**Critical prerequisite — Java 11+:** +All installation paths require Java 11 or higher. Python and Node.js wrappers spawn a JVM internally. Verify with `java -version`. + +If Java is missing or below version 11: +> "Java 11 or higher is required. Please install a JDK for your environment." + +Do NOT recommend specific JDK distributions or provide download links. + +--- + +### 2B. Local vs. Hybrid Architecture + +> Load `references/hybrid-guide.md` when the user needs detailed hybrid server setup. + +**Decision tree — select the right processing mode:** + +``` +PDF characteristics: +│ +├── Digital PDF + clear bordered tables +│ └── Local only, --table-method default (~0.05s/page, no server needed) +│ +├── Digital PDF + borderless or complex tables +│ └── --table-method cluster (local, slightly slower) +│ └── Still insufficient? → --hybrid docling-fast +│ +├── Scanned / image-only PDF +│ └── --hybrid docling-fast (+ server started with --force-ocr) +│ +├── Non-English scanned PDF +│ └── --hybrid docling-fast (+ server --force-ocr --ocr-lang "ko,en") +│ +├── Mathematical formulas +│ └── --hybrid docling-fast --hybrid-mode full +│ (+ server --enrich-formula) +│ +├── Charts needing descriptions +│ └── --hybrid docling-fast --hybrid-mode full +│ (+ server --enrich-picture-description) +│ +└── Mixed batch (unknown PDF types) + └── --hybrid docling-fast (auto triage routes pages automatically) +``` + +**When hybrid mode is selected, remind the user:** +The hybrid server must be running before conversion starts. Quick start: + +```bash +# Terminal 1: start the server +opendataloader-pdf-hybrid --port 5002 + +# Terminal 2: run conversion +opendataloader-pdf input.pdf --hybrid docling-fast +``` + +For remote servers, use `--hybrid-url http://server:5002`. + +--- + +### 2C. Output Format Selection + +> Load `references/format-guide.md` when the user needs format-specific details. + +**Decision tree — match format to downstream use:** + +``` +Downstream use: +├── RAG + source citation / page-level tracing needed +│ └── json (includes bounding boxes, page numbers, element types) +│ +├── RAG text chunking without spatial metadata +│ └── markdown +│ +├── LangChain document loader +│ └── langchain-opendataloader-pdf (format=text, returns LangChain Document objects) +│ +├── Web display +│ └── html +│ +├── Extraction quality debugging +│ └── pdf + json (annotated PDF shows bounding boxes; JSON has element data) +│ +├── Plain text search / indexing +│ └── text +│ +└── Text with embedded or referenced images + └── markdown-with-images +``` + +Multiple formats can be requested in one pass: + +```bash +opendataloader-pdf input.pdf --format json,markdown,html +``` + +--- + +### 2D. Option Combination + +> For full option reference, see `references/options-matrix.md`. If this project's `options.json` is available, it is the authoritative source of truth. Options in `options.json` not found in `options-matrix.md` are newly added options. + +**Common option combinations by use case:** + +| Use case | Recommended options | +|----------|---------------------| +| RAG pipeline, digital PDFs | `--format json --use-struct-tree` | +| RAG pipeline, mixed PDFs | `--format json --hybrid docling-fast` | +| Scanned PDF batch | `--hybrid docling-fast --format markdown --quiet` | +| Formula-heavy academic PDF | `--hybrid docling-fast --hybrid-mode full --format markdown` (server: `--enrich-formula`) | +| Web publishing | `--format html --image-output embedded` | +| Debugging table quality | `--format json,pdf --table-method cluster` | +| Page-range extraction | `--format markdown --pages "1,3,5-10"` | +| Sensitive data pipeline | `--format json --sanitize` | + +--- + +## Phase 3: EXECUTE + +Two modes of operation depending on user intent. + +### 3A. Guide Mode + +When the user wants ready-to-run commands but will execute them manually. + +Generate complete, copy-pasteable commands for the relevant interface. + +**CLI:** +```bash +opendataloader-pdf input.pdf \ + --format markdown \ + --output-dir ./output \ + --hybrid docling-fast \ + --quiet +``` + +**Python:** +```python +from opendataloader_pdf import PdfConverter, ConversionOptions + +options = ConversionOptions( + format=["markdown"], + hybrid="docling-fast", + output_dir="./output" +) + +converter = PdfConverter(options) + +# Process all files in a single batch call — avoids multiple JVM startups +results = converter.convert(["file1.pdf", "file2.pdf", "file3.pdf"]) + +for result in results: + print(result.markdown) +``` + +**Node.js:** +```javascript +const { PdfConverter } = require('@opendataloader/pdf'); + +const converter = new PdfConverter({ + format: ['markdown'], + hybrid: 'docling-fast', + outputDir: './output' +}); + +// Batch all files in one call +const results = await converter.convert(['file1.pdf', 'file2.pdf']); +results.forEach(r => console.log(r.markdown)); +``` + +**LangChain integration:** +```python +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader + +loader = OpenDataLoaderPDFLoader( + file_path="document.pdf", + format="text", + hybrid="docling-fast" # optional: enable for scanned PDFs +) + +documents = loader.load() +# documents is a list of LangChain Document objects with page_content and metadata +``` + +**Java (Maven project):** +```java +PdfConversionOptions options = PdfConversionOptions.builder() + .format(List.of("markdown")) + .hybrid("docling-fast") + .outputDir(Path.of("./output")) + .build(); + +PdfConverter converter = new PdfConverter(options); +List results = converter.convert(List.of( + Path.of("file1.pdf"), Path.of("file2.pdf") +)); +``` + +### 3B. Action Mode + +When the user says "convert", "extract", "run", "process", or similar action verbs — execute the conversion directly. + +**A1. Check environment** + +Run detect-env.sh. Verify: +- `ODL_INSTALLED=true` — if false, install first (Phase 2A) +- `JAVA` is version 11 or higher — if missing or below, stop and show the Java requirement message + +**A2. Determine PDF characteristics** + +If not already known from Phase 1, inspect the PDF: +- Check file size relative to page count (large file = likely image-heavy or scanned) +- Ask or infer: digital vs. scanned, table complexity, formula presence + +**A3. Auto-select options** + +Apply the decision trees from Phase 2B and 2C. Construct the command. + +**A4. Show command, get approval, execute** + +Always show the generated command to the user and ask for confirmation before running: + +``` +I'll run the following command: + + opendataloader-pdf document.pdf --format json,markdown --hybrid docling-fast + +Proceed? (yes/no) +``` + +If the user confirms, execute. Stream output to the terminal. + +**A5. Verify results** + +After execution: +- Check that output files were created in the expected directory +- For JSON output: confirm element count is non-zero +- If errors occurred or output looks wrong → Phase 4 + +--- + +## Phase 4: DIAGNOSE + +When extraction quality is inadequate. Start with measurement, then escalate. + +### 4A. Measure Quality + +Run the quick evaluation script against your output: + +```bash +python skills/odl-pdf/scripts/quick-eval.py \ + --input output/document.json \ + --reference ground-truth.json +``` + +Or run the full benchmark to get NID, TEDS, and MHS scores: + +```bash +bash scripts/bench.sh --doc-id +``` + +**Metric reference:** + +| Metric | Measures | Low score means | +|--------|----------|-----------------| +| NID | Reading order accuracy | Content is out of sequence | +| TEDS | Table structure accuracy | Tables are malformed or merged | +| MHS | Heading hierarchy accuracy | Section structure is wrong | +| Table Detection F1 | Table region detection | Tables are missed or over-detected | + +### 4B. Diagnosis by Symptom + +**Tables are malformed or missing structure:** +``` +Step 1: Switch table method + --table-method cluster + (detects borderless tables using spatial clustering) + +Step 2: If still failing, add hybrid backend + --hybrid docling-fast + (uses AI-based table detection) + +Step 3: Inspect with annotated PDF + --format json,pdf + (annotated PDF shows detected table bounding boxes) +``` + +**Reading order is wrong (content out of sequence):** +``` +Step 1: Check if PDF is tagged (has structure tree) + Add: --use-struct-tree + (uses PDF's built-in reading order metadata if present) + +Step 2: If PDF is multi-column, xycut algorithm should handle it + Verify: --reading-order xycut (this is the default) + +Step 3: Check for scanned PDF + If scanned: --hybrid docling-fast --force-ocr (on server) +``` + +**Text is garbled or contains replacement characters:** +``` +Step 1: Check for encoding issues + Add: --replace-invalid-chars "?" (makes bad characters visible) + +Step 2: If it's a scanned PDF + Switch to: --hybrid docling-fast (+ server --force-ocr) + +Step 3: For non-Latin scripts + Add: --ocr-lang "ja,en" (on hybrid server startup) +``` + +**Formulas are not extracted:** +``` +Requirements check: + - Client must use: --hybrid docling-fast --hybrid-mode full + - Server must be started with: --enrich-formula + - Both conditions are required — one without the other silently skips enrichment +``` + +**Images have no descriptions:** +``` +Requirements check: + - Client must use: --hybrid docling-fast --hybrid-mode full + - Server must be started with: --enrich-picture-description + - Same pattern as formula enrichment +``` + +**Hidden or unexpected text in output:** +``` +Content safety filters are active by default. +To inspect raw content: --content-safety-off all +To selectively disable: --content-safety-off hidden-text,off-page +``` + +### 4C. Escalation Path + +``` +Quality escalation (in order): +1. Local defaults → fastest, least accurate for complex PDFs +2. --table-method cluster → better borderless table detection (local) +3. --hybrid docling-fast → AI-powered, auto-triage (hybrid) +4. --hybrid-mode full → all pages go to backend (no triage, maximum accuracy) +5. Full benchmark → measure NID/TEDS/MHS to identify specific weak points +``` + +--- + +## Phase 5: OPTIMIZE + +For production pipelines processing large volumes. + +### 5A. Batch Processing + +**The single most impactful optimization: batch all files in one call.** + +Each `convert()` call spawns a JVM. Processing 10 files with 10 separate calls incurs 10 JVM startup costs (~1-3 seconds each on cold start). + +```python +# Wrong — 10 JVM startups +for pdf in pdf_files: + converter.convert([pdf]) + +# Correct — 1 JVM startup, parallel page processing inside +converter.convert(pdf_files) +``` + +The Java core uses `ForkJoinPool` with `availableProcessors` for within-batch parallelism. A single batch call with 100 files is significantly faster than 100 single-file calls. + +### 5B. Hybrid Server Tuning + +**Timeout configuration** — prevent slow backend pages from blocking the pipeline: + +```bash +# Client: set a 30-second timeout per page request +opendataloader-pdf input.pdf --hybrid docling-fast --hybrid-timeout 30000 +``` + +**Fallback behavior** — fall back to Java extraction on backend errors: + +```bash +opendataloader-pdf input.pdf \ + --hybrid docling-fast \ + --hybrid-timeout 30000 \ + --hybrid-fallback +``` + +With `--hybrid-fallback`, pages that time out or cause server errors are processed locally by Java instead of failing the entire document. + +**Remote server** — for multi-machine deployments: + +```bash +# Start server on a GPU machine +opendataloader-pdf-hybrid --port 5002 + +# Clients point to it +opendataloader-pdf input.pdf \ + --hybrid docling-fast \ + --hybrid-url http://gpu-server:5002 +``` + +### 5C. LangChain RAG Pipeline + +**Recommended architecture for RAG:** + +```python +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader +from langchain.text_splitter import RecursiveCharacterTextSplitter +from langchain.vectorstores import Chroma +from langchain.embeddings import OpenAIEmbeddings + +# 1. Load PDFs with bounding-box metadata for source citation +loader = OpenDataLoaderPDFLoader( + file_path="document.pdf", + format="text", # returns LangChain Documents with metadata + hybrid="docling-fast" # enable for scanned or complex PDFs +) +documents = loader.load() + +# 2. Chunk with overlap — ODL markdown headings are natural split points +splitter = RecursiveCharacterTextSplitter( + chunk_size=1000, + chunk_overlap=200, + separators=["\n## ", "\n### ", "\n\n", "\n", " "] +) +chunks = splitter.split_documents(documents) + +# 3. Index +vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings()) +``` + +**Tip:** Use `format="json"` instead of `format="text"` when you need bounding boxes in metadata for source citation (linking a RAG answer back to a specific page region). + +### 5D. Output Pipeline Options + +**Quiet mode for automated pipelines** — suppress progress output: +```bash +opendataloader-pdf input.pdf --format markdown --quiet +``` + +**Stdout for pipe-based workflows** — single format, output to stdout: +```bash +opendataloader-pdf input.pdf --format markdown --to-stdout | jq . +``` + +**Page range extraction** — process only relevant pages: +```bash +# Pages 1, 3, and 5 through 10 +opendataloader-pdf input.pdf --pages "1,3,5-10" --format markdown +``` + +**Custom page separators** — for downstream splitting: +```bash +opendataloader-pdf input.pdf \ + --format markdown \ + --markdown-page-separator "---PAGE %page-number%---" +``` + +--- + +## Critical Gotchas + +These three issues cause the majority of user-reported problems. Check these before diving deeper into any diagnosis. + +### Gotcha 1: Java 11+ Is Always Required + +**Every installation path requires Java 11 or higher.** Python packages, Node.js packages, and the CLI all spawn a JVM internally. There is no pure-Python or pure-JavaScript path. + +**Symptom:** `java.lang.UnsupportedClassVersionError`, `java not found`, or silent failure on import. + +**Resolution:** `java -version` must show version 11 or higher. + +If Java is missing or below version 11: +> "Java 11 or higher is required. Please install a JDK for your environment." + +Do NOT recommend specific distributions or provide download links. + +--- + +### Gotcha 2: Enrichment Options Require --hybrid-mode full + +**`--enrich-formula` and `--enrich-picture-description` are server-side enrichments that only run in full mode.** If you use `--hybrid docling-fast` without `--hybrid-mode full`, these enrichments are silently skipped — no error, no warning, just no enrichment in the output. + +**Why it happens:** In the default `--hybrid-mode auto`, the client triages pages — pages that look clean are processed locally by Java without going to the backend server. Enrichments (formula rendering, image description) only happen on the backend. So triage-mode pages never get enriched. + +**Fix:** Always pair enrichment flags with `--hybrid-mode full`: + +```bash +# Client +opendataloader-pdf input.pdf \ + --hybrid docling-fast \ + --hybrid-mode full \ # required for enrichments + --format markdown + +# Server (started separately) +opendataloader-pdf-hybrid --port 5002 --enrich-formula +``` + +--- + +### Gotcha 3: One Batch Call, Not N Single-File Calls + +**Each `convert()` call in Python/Node, or each CLI invocation, starts a new JVM.** If you process N files with N separate calls, you pay N JVM startup costs. On typical hardware this is 1-3 seconds per cold start. + +**Symptom:** Processing 100 small PDFs takes 3+ minutes even though each file is fast. + +**Fix:** Pass all files to a single `convert()` call. The Java core handles parallelism internally. + +```python +# Wrong +for pdf_path in pdf_list: + result = converter.convert([pdf_path]) # N JVM starts + +# Correct +results = converter.convert(pdf_list) # 1 JVM start, parallel processing +``` + +For CLI batch processing, prefer a glob pattern or a file list argument over shell loops. + +--- + +## Option Reference + +This skill contains a working knowledge of all 26 options from `options.json`. The table below covers the most commonly used options. For the complete, authoritative option list, see: + +- `options.json` in the project root (authoritative — always current) +- `references/options-matrix.md` (annotated reference with examples and use-case guidance) + +Options in `options.json` that are not yet documented in `references/options-matrix.md` are newly added — treat `options.json` as the source of truth. + +### Commonly Used Options Quick Reference + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `--format` / `-f` | string | json | Output format(s). Values: `json`, `text`, `html`, `pdf`, `markdown`, `markdown-with-html`, `markdown-with-images`. Comma-separate for multiple. | +| `--output-dir` / `-o` | string | input dir | Directory for output files. | +| `--quiet` / `-q` | boolean | false | Suppress progress output. | +| `--pages` | string | all | Pages to extract. Format: `"1,3,5-7"` | +| `--table-method` | string | default | Table detection. Values: `default` (border-based), `cluster` (border + spatial clustering). | +| `--reading-order` | string | xycut | Reading order algorithm. Values: `off`, `xycut`. | +| `--use-struct-tree` | boolean | false | Use PDF structure tree (tagged PDF) for reading order. | +| `--hybrid` | string | off | Hybrid backend. Values: `off`, `docling-fast`. | +| `--hybrid-mode` | string | auto | Triage mode. Values: `auto` (dynamic triage), `full` (all pages to backend). | +| `--hybrid-url` | string | null | Remote hybrid server URL. | +| `--hybrid-timeout` | string | 0 | Request timeout in ms. 0 = no timeout. | +| `--hybrid-fallback` | boolean | false | Fall back to Java on backend error. | +| `--image-output` | string | external | Image handling. Values: `off`, `embedded` (Base64), `external` (file refs). | +| `--image-format` | string | png | Image format. Values: `png`, `jpeg`. | +| `--image-dir` | string | null | Directory for extracted images. | +| `--include-header-footer` | boolean | false | Include page headers and footers. | +| `--keep-line-breaks` | boolean | false | Preserve original line breaks. | +| `--sanitize` | boolean | false | Replace emails, phones, IPs, credit cards, URLs with placeholders. | +| `--password` / `-p` | string | null | Password for encrypted PDFs. | +| `--content-safety-off` | string | null | Disable safety filters. Values: `all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg`. | +| `--replace-invalid-chars` | string | space | Replacement for unrecognized characters. | +| `--markdown-page-separator` | string | null | Separator between pages in Markdown. Use `%page-number%` for page number. | +| `--text-page-separator` | string | null | Separator between pages in text output. | +| `--html-page-separator` | string | null | Separator between pages in HTML output. | +| `--to-stdout` | boolean | false | Write output to stdout (single format only). | +| `--detect-strikethrough` | boolean | false | Detect strikethrough text. Experimental. | + +--- + +## Reference Files + +Load these files progressively — only when entering the relevant topic. Do not load all references at session start. + +| File | Load when | +|------|-----------| +| `references/installation-matrix.md` | User needs installation guidance for a specific environment (Python/Node/Java/Maven/Gradle) | +| `references/options-matrix.md` | User needs detailed option documentation, defaults, or interactions | +| `references/hybrid-guide.md` | User needs hybrid server setup, server-side flags, or remote deployment | +| `references/format-guide.md` | User needs output format comparison, format-specific behavior, or format selection | +| `scripts/detect-env.sh` | Phase 1 environment detection — run at session start | +| `scripts/quick-eval.py` | Phase 4 quality measurement — run when diagnosing extraction quality | +| `evals/` | Benchmark baselines and regression thresholds | + +--- + +## Quality Metrics Reference + +When running benchmarks or evaluating extraction quality, these are the five metrics reported by `scripts/bench.sh`: + +| Metric | Full Name | What It Measures | Target | +|--------|-----------|-----------------|--------| +| NID | Normalized Inversion Distance | Reading order correctness (sequence of extracted elements) | Higher is better (max 1.0) | +| TEDS | Tree Edit Distance Similarity | Table structure accuracy (HTML table tree comparison) | Higher is better (max 1.0) | +| MHS | Mean Heading Similarity | Heading hierarchy accuracy (section structure) | Higher is better (max 1.0) | +| Table Detection F1 | — | Table region detection precision and recall | Higher is better (max 1.0) | +| Speed | Pages/second | Extraction throughput | Context-dependent | + +**Interpreting weak metrics:** + +- Low NID → reading order problem. Try `--use-struct-tree` for tagged PDFs, or hybrid mode for scanned. +- Low TEDS → table structure problem. Try `--table-method cluster`, then `--hybrid docling-fast`. +- Low MHS → heading detection problem. Review if the PDF uses visual formatting (font size) instead of tagged headings. `--use-struct-tree` may help for tagged PDFs. +- Low Table Detection F1 → tables are being missed or extra regions are detected as tables. Inspect with `--format pdf` (annotated output) to see bounding boxes. + +To debug a specific document: +```bash +bash scripts/bench.sh --doc-id +``` + +To check regressions in CI: +```bash +bash scripts/bench.sh --check-regression +``` + +--- + +## Session Checklist + +Use this as a mental checklist for any extraction request: + +- [ ] Phase 1: Run detect-env.sh or ask about environment +- [ ] Phase 1: Know the PDF type (digital/scanned/mixed) +- [ ] Phase 1: Know the downstream use case +- [ ] Phase 2: Confirm Java 11+ is present +- [ ] Phase 2: Selected local vs. hybrid based on PDF type +- [ ] Phase 2: Selected output format based on downstream use +- [ ] Phase 3: Generated or executed the command +- [ ] Phase 3: Verified output files exist and are non-empty +- [ ] If quality issues: Phase 4 — measure NID/TEDS/MHS before escalating +- [ ] If enrichment needed: confirmed `--hybrid-mode full` is set on client +- [ ] If batch processing: confirmed all files passed in one `convert()` call diff --git a/skills/odl-pdf/evals/evals.json b/skills/odl-pdf/evals/evals.json new file mode 100644 index 000000000..f66e4f4f6 --- /dev/null +++ b/skills/odl-pdf/evals/evals.json @@ -0,0 +1,111 @@ +{ + "version": "1.0", + "skill": "odl-pdf", + "evals": [ + { + "id": "eval-001", + "scenario": "A data engineer is building a RAG pipeline over 500 scientific papers and needs to preserve source citations (page and region) for each retrieved chunk. They ask which mode and format to use.", + "user_input": "I need to process 500 scientific papers for a RAG pipeline. I need to know exactly which page and region each chunk came from for source citation. What's the best setup?", + "expected_recommendations": [ + "Use hybrid mode for best accuracy on scientific papers", + "Use json format (or json combined with markdown) because JSON output includes bounding boxes per element", + "Mention bounding boxes as the mechanism for source citation", + "Recommend batching all files in a single convert() call rather than looping" + ], + "must_mention": [ + "hybrid", + "json", + "bounding box", + "batch" + ], + "must_not_mention": [ + "text format as primary recommendation", + "loop convert() for each file separately without warning" + ] + }, + { + "id": "eval-002", + "scenario": "A developer on an M1 Mac needs to process Korean government PDFs, which are scanned image-based documents with mixed Korean and English text. They do not specify their OS or hardware unless asked.", + "user_input": "I'm on an M1 Mac and need to parse Korean government PDFs. They're scanned documents with both Korean and English text.", + "expected_recommendations": [ + "Use hybrid mode with OCR enabled (--force-ocr) because the documents are scanned", + "Set --ocr-lang to 'ko,en' for mixed-language OCR", + "Confirm Java is installed (java -version) as a prerequisite", + "Two terminals required: one for the hybrid server, one for the client" + ], + "must_mention": [ + "hybrid", + "--force-ocr", + "--ocr-lang", + "ko,en", + "java" + ], + "must_not_mention": [ + "local mode as sufficient for scanned PDFs", + "GPU required" + ] + }, + { + "id": "eval-003", + "scenario": "A user reports that tables in their extracted output are broken — cells are merged incorrectly and some borderless tables are completely missing. They are currently using local mode with default settings.", + "user_input": "The tables in my extracted output look broken. Cells are getting merged together and some tables are missing entirely. I'm using the default settings.", + "expected_recommendations": [ + "Diagnose using the TEDS metric to confirm it is a table quality issue", + "First escalation: try --table-method cluster for borderless table detection", + "Second escalation: switch to hybrid mode (--hybrid docling-fast) with auto triage", + "Third escalation: use --hybrid-mode full to force all pages through the AI backend" + ], + "must_mention": [ + "--table-method cluster", + "hybrid", + "TEDS" + ], + "must_not_mention": [ + "this is a known limitation with no workaround", + "--use-struct-tree as a table fix" + ] + }, + { + "id": "eval-004", + "scenario": "A Node.js developer on Windows wants to use hybrid mode. They are unfamiliar with the two-process architecture and expect a single npm install to be sufficient.", + "user_input": "I'm using Node.js on Windows and want to set up hybrid mode. I installed @opendataloader/pdf but I'm not sure what else I need.", + "expected_recommendations": [ + "Explain that hybrid mode requires a separate Python server process (opendataloader-pdf-hybrid)", + "Provide a two-terminal setup: Terminal 1 for the Python hybrid server, Terminal 2 for the Node.js client", + "Include the pip install command for the server component", + "Confirm Java 11+ is required as a prerequisite" + ], + "must_mention": [ + "pip install", + "opendataloader-pdf-hybrid", + "two terminals", + "java" + ], + "must_not_mention": [ + "hybrid mode works with npm install alone", + "GPU required for basic hybrid setup" + ] + }, + { + "id": "eval-005", + "scenario": "A researcher processing math-heavy academic papers wants both LaTeX formula extraction and AI-generated descriptions of charts and figures. They ask what settings are needed.", + "user_input": "I'm processing academic papers with math formulas and charts. I need the formulas extracted as LaTeX and I want AI descriptions of the charts and figures. How do I set this up?", + "expected_recommendations": [ + "Start the hybrid server with both --enrich-formula and --enrich-picture-description flags", + "Run the client with --hybrid-mode full (required for enrichments to apply)", + "Warn that enrichments are silently skipped if --hybrid-mode full is omitted from the client command", + "Use --hybrid docling-fast as the backend" + ], + "must_mention": [ + "--enrich-formula", + "--enrich-picture-description", + "--hybrid-mode full", + "hybrid" + ], + "must_not_mention": [ + "enrichments work in auto mode", + "enrichments are client-side options" + ] + } + ] +} diff --git a/skills/odl-pdf/references/eval-metrics.md b/skills/odl-pdf/references/eval-metrics.md new file mode 100644 index 000000000..6f09c58ee --- /dev/null +++ b/skills/odl-pdf/references/eval-metrics.md @@ -0,0 +1,186 @@ +# Evaluation Metrics Reference + +This document explains the metrics used in opendataloader-pdf benchmarks, how to interpret them, and how to diagnose quality problems using them. + +--- + +## Metrics + +### NID — Normalized Information Distance + +**What it measures:** Reading order accuracy. Quantifies how well the extracted text preserves the correct reading sequence compared to the ground truth. + +**Intuition:** A PDF with two side-by-side columns must interleave text in the right column order, not left-to-right line by line across both columns. NID penalizes any reordering of the logical reading sequence. + +**Range:** 0–1. Higher is better. A score of 1.0 means extracted order exactly matches ground truth. + +**Typical failure modes:** Multi-column layouts, tables with merged cells, footnotes that appear inline, sidebars. + +--- + +### TEDS — Tree-Edit Distance Similarity + +**What it measures:** Table structure accuracy. Measures the structural similarity between extracted table trees and ground-truth table trees using tree edit distance. + +**Intuition:** A table with 3 rows and 4 columns must be reconstructed with the correct cell boundaries, spanning cells, and hierarchy. TEDS counts the minimum number of insertions, deletions, and substitutions needed to convert the extracted tree into the ground truth, then normalizes by tree size. + +**Range:** 0–1. Higher is better. A score of 1.0 means the extracted table structure is identical to ground truth. + +**Typical failure modes:** Borderless tables, merged/spanning cells, nested tables, tables that are actually images. + +--- + +### MHS — Markdown Heading Similarity + +**What it measures:** Heading structure accuracy. Measures how well the extracted heading hierarchy (h1, h2, h3) matches the ground truth. + +**Intuition:** A document with a clear section/subsection structure should produce headings at the correct levels. MHS compares the heading tree of the extracted output against the ground truth, penalizing both missing headings and incorrect level assignments. + +**Range:** 0–1. Higher is better. A score of 1.0 means all headings are correctly detected and assigned to the right level. + +**Typical failure modes:** PDFs that simulate headings using bold text (no semantic markup), decorative section dividers, heading text embedded in images. + +--- + +### Table Detection F1 + +**What it measures:** Precision and recall of table boundary detection. Precision = fraction of detected tables that are real tables. Recall = fraction of real tables that were detected. + +**Intuition:** F1 is the harmonic mean of precision and recall, balancing false positives (detecting non-tables as tables) against false negatives (missing tables entirely). Unlike TEDS, Table Detection F1 does not evaluate the internal structure — only whether the table region was found. + +**Range:** 0–1. Higher is better. + +**Typical failure modes:** Dense text blocks that resemble tables, tables that span page boundaries, very small tables. + +--- + +### Speed + +**What it measures:** Processing throughput in seconds per page. + +**Interpretation:** Lower is better. Scores vary significantly by mode: + +| Mode | Approximate throughput | +|------|----------------------| +| Local (no hybrid) | ~0.015 s/page | +| Hybrid `auto` (mixed document) | Varies; most pages stay at Java speed | +| Hybrid `full` | ~0.463 s/page | + +Speed is not normalized to 0–1. It is an absolute wall-clock measurement averaged over the benchmark document set. + +--- + +## Benchmark Reference Scores + +**200 real-world PDFs including multi-column layouts and scientific papers.** + +| Engine | Overall | NID (Reading Order) | TEDS (Table) | MHS (Heading) | Speed (s/page) | +|--------|---------|---------------------|--------------|---------------|----------------| +| **opendataloader [hybrid]** | **0.907** | **0.934** | **0.928** | 0.821 | 0.463 | +| opendataloader [local] | 0.831 | 0.902 | 0.489 | 0.739 | **0.015** | + +Full benchmark results and methodology: [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) + +--- + +## Diagnostic Guide: Which Metric Is Weak? + +Use this guide when extraction quality is below expectations. Start by identifying which metric is low, then follow the recommended steps. + +--- + +### Low NID — Reading Order Problems + +**Symptoms:** Text from different columns or sections is interleaved incorrectly. Paragraphs appear out of sequence. Footnotes appear in the wrong position. + +**Steps:** + +1. Check if the PDF is tagged. If it is, try `--use-struct-tree`. Tagged PDFs contain an explicit reading order tree that is usually more reliable than layout analysis. + + ```bash + opendataloader-pdf input.pdf --use-struct-tree + ``` + +2. For multi-column layouts, verify that the XY-Cut algorithm is active (it is the default). Ensure `--reading-order xycut` is set. + +3. For complex layouts where XY-Cut still fails, route the document through hybrid mode — the AI backend handles unusual layouts more robustly. + + ```bash + opendataloader-pdf --hybrid docling-fast input.pdf + ``` + +--- + +### Low TEDS — Table Quality Problems + +**Symptoms:** Tables are extracted as plain text. Cells are merged incorrectly. Columns are misaligned. Borderless tables are missed entirely. + +**Escalation path — try each step in order and stop when quality is acceptable:** + +1. **Enable cluster detection.** The default table method detects bordered tables. The `cluster` method adds borderless table detection. + + ```bash + opendataloader-pdf input.pdf --table-method cluster + ``` + +2. **Switch to hybrid mode.** If `cluster` is insufficient, route the document through the AI backend. Use `auto` mode first — it sends complex pages to the backend while keeping simple pages on the fast local path. + + ```bash + opendataloader-pdf --hybrid docling-fast input.pdf + ``` + +3. **Use hybrid full mode.** If `auto` mode still misses tables (because the triage step classifies them as simple), force all pages through the backend. + + ```bash + opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf + ``` + +--- + +### Low MHS — Heading Detection Problems + +**Symptoms:** Document headings are not recognized, appear as plain paragraphs, or are assigned to the wrong level (e.g., h1 instead of h2). + +**Steps:** + +1. Check whether the PDF uses real headings or simulated headings. Real headings are marked semantically in the PDF (large font, bold, specific style). Simulated headings are visually similar but have no semantic markup — they are just bold text at a larger font size, indistinguishable from the tool's perspective. + + - To check: open the PDF in a reader that exposes the tag tree (Adobe Acrobat > Accessibility > Reading Order, or use a preflight tool). If there is no tag tree, the headings are visual only. + +2. If the PDF is tagged and headings are still missed, try `--use-struct-tree`. This reads semantic structure directly from the PDF's tag tree. + + ```bash + opendataloader-pdf input.pdf --use-struct-tree + ``` + +3. If the PDF is untagged and headings are simulated with bold text, the heading structure cannot be recovered reliably from layout alone. Consider whether hybrid mode improves detection for your specific document class. + +--- + +## Running Benchmarks + +### Full benchmark suite + +```bash +./scripts/bench.sh +``` + +This script automatically clones [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) (which contains the benchmark PDFs and evaluation logic), runs extraction across all documents, and prints scores for each metric. + +Additional flags: + +```bash +# Debug a specific document by ID +./scripts/bench.sh --doc-id + +# CI mode: check against regression thresholds and exit non-zero on failure +./scripts/bench.sh --check-regression +``` + +### Quick eval on your own documents + +```bash +python skills/odl-pdf/scripts/quick-eval.py +``` + +This script runs a subset evaluation suitable for rapid iteration. It processes a small representative sample and reports per-metric scores without requiring the full benchmark corpus. diff --git a/skills/odl-pdf/references/format-guide.md b/skills/odl-pdf/references/format-guide.md new file mode 100644 index 000000000..ff767a311 --- /dev/null +++ b/skills/odl-pdf/references/format-guide.md @@ -0,0 +1,61 @@ +# Output Format Guide + +opendataloader-pdf supports 7 output formats via the `format` option. This guide helps you choose the right format for your use case. + +## Format Overview + +| Format | Best For | Bounding Boxes | Tables | Images | +|---|---|---|---|---| +| `json` | Programmatic processing, source citation | Yes | Structured | As references | +| `text` | Plain text extraction, search indexing | No | Flattened | Omitted | +| `html` | Web display | No | Native `` | Inline | +| `pdf` | Visual debugging of extraction results | Yes (annotated) | Highlighted | Preserved | +| `markdown` | Documentation, RAG chunking | No | Markdown syntax | Omitted | +| `markdown-with-html` | Complex tables in Markdown | No | HTML `
` | Omitted | +| `markdown-with-images` | Documentation with visuals | No | Markdown syntax | Embedded/external | + +## Downstream Use Mapping + +Choose your format based on what you're building: + +| Use Case | Recommended Format | Notes | +|---|---|---| +| RAG + source citation | `json` | Bounding boxes enable precise page/region references | +| RAG text chunking | `markdown` | Clean structure maps well to chunk boundaries | +| LangChain integration | `text` | Use with `langchain-opendataloader-pdf`; format=text is the default | +| Web display | `html` | Renders natively in browsers | +| Quality / extraction debugging | `pdf` + `json` | Annotated PDF shows what was detected; JSON shows coordinates | +| Plain text search | `text` | Smallest output, no markup overhead | +| Documentation with images | `markdown-with-images` | Images embedded inline or written to a directory | +| Complex table fidelity | `markdown-with-html` | Falls back to HTML tables where Markdown syntax loses structure | + +## Related Options + +These options affect output when using image-bearing or multi-page formats: + +- `image-output` — Controls whether images are embedded (base64) or written to files (`dir`). +- `image-format` — Image encoding format for extracted images (e.g., `png`, `jpeg`). +- `image-dir` — Directory path for externalized images when `image-output=dir`. +- `*-page-separator` — Format-specific option to insert a custom separator between pages (e.g., `markdown-page-separator`, `text-page-separator`). + +## Tips + +**Multiple formats in one call** + +You can produce multiple formats in a single invocation by passing a comma-separated list: + +``` +opendataloader-pdf input.pdf --format markdown,json +``` + +This avoids parsing the PDF twice and ensures both outputs are consistent. + +**Piping output with `--to-stdout`** + +Use `--to-stdout` to write output directly to standard output instead of a file. Useful for piping into other tools: + +``` +opendataloader-pdf input.pdf --format text --to-stdout | my-indexer +``` + +Note: When using `--to-stdout` with multiple formats, only single-format output is supported. diff --git a/skills/odl-pdf/references/hybrid-guide.md b/skills/odl-pdf/references/hybrid-guide.md new file mode 100644 index 000000000..899e7ac62 --- /dev/null +++ b/skills/odl-pdf/references/hybrid-guide.md @@ -0,0 +1,174 @@ +# Hybrid Mode Reference Guide + +Hybrid mode extends opendataloader-pdf by routing complex PDF pages to an external AI backend while keeping simple pages on the fast local Java path. This gives you the speed of the Java engine for most content, with AI-quality output for tables, scanned pages, formulas, and charts. + +--- + +## Overview + +By default, opendataloader-pdf processes everything locally in Java. Hybrid mode adds a second processing path — a Python-based server running [docling-serve](https://github.com/DS4SD/docling-serve) — and routes pages between the two based on complexity. + +**When you need hybrid mode:** + +- PDFs with scanned or image-based pages (OCR required) +- Complex table structures that the Java heuristics miss +- Documents containing mathematical formulas (LaTeX extraction) +- Charts or images that need AI-generated descriptions +- Non-English documents requiring language-specific OCR + +--- + +## Quick Setup + +Hybrid mode requires two running processes: the server and the client. + +**Terminal 1 — Start the hybrid server:** + +```bash +# Install the server component +pip install opendataloader-pdf-hybrid + +# Start with defaults (port 5002) +opendataloader-pdf-hybrid --port 5002 +``` + +**Terminal 2 — Run the client:** + +```bash +# Basic hybrid: per-page triage, docling-fast backend +opendataloader-pdf --hybrid docling-fast input.pdf + +# Full mode: send all pages to the backend +opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf +``` + +The client connects to `http://localhost:5002` by default. No additional configuration is needed for a local setup. + +--- + +## Triage Modes + +Control how pages are routed with `--hybrid-mode`. + +| Mode | Flag | Behavior | +|------|------|----------| +| auto | `--hybrid-mode auto` | Per-page triage. Simple pages stay on Java; complex pages go to the backend. **Default.** | +| full | `--hybrid-mode full` | All pages go to the backend. Required for enrichment features. | + +### When to use `auto` + +`auto` is the default and works well for mixed documents. The triage strategy is conservative: it prefers to send borderline pages to the backend (minimizing missed complex content) at the cost of some extra backend calls. + +Expected throughput: +- Simple pages (Java path): ~0.015 s/page +- Complex pages (backend path): varies by content and hardware +- Overall for a mixed document: between the two extremes + +### When to use `full` + +Use `full` when you need enrichment features (`--enrich-formula`, `--enrich-picture-description`) or when the entire document is scanned and you want consistent OCR output across all pages. + +Expected throughput with `full`: approximately 0.5 s/page (depends on backend and GPU availability). + +> **Important:** `--enrich-formula` and `--enrich-picture-description` are server-side options, but they only take effect when the client is running with `--hybrid-mode full`. In `auto` mode, enrichments are silently skipped — no warning or error is shown. If your output is missing formulas or image descriptions, check that you have `--hybrid-mode full` set on the client side. + +--- + +## Client Options + +| Option | Values | Default | Description | +|--------|--------|---------|-------------| +| `--hybrid ` | `off`, `docling-fast` | `off` | Select the backend. `off` disables hybrid mode entirely. | +| `--hybrid-mode ` | `auto`, `full` | `auto` | Page routing strategy. | +| `--hybrid-url ` | Any URL | `http://localhost:5002` | Override the server URL for remote or non-default setups. | +| `--hybrid-timeout ` | Integer | — | Request timeout in milliseconds. Set to `0` to disable timeout. | +| `--hybrid-fallback` | Flag | Disabled | Fall back to the Java path if the backend returns an error. | + +--- + +## Server Configuration + +All options are passed when starting `opendataloader-pdf-hybrid`. + +| Option | Default | Description | +|--------|---------|-------------| +| `--port ` | `5002` | Port the server listens on. | +| `--force-ocr` | Off | Run OCR on every page, even if the page has selectable text. Use this for scanned PDFs where embedded text is unreliable. | +| `--ocr-lang ""` | `"en"` | Comma-separated language codes for OCR (e.g., `"ko,en"`). Improves accuracy for non-English documents. | +| `--enrich-formula` | Off | Extract mathematical formulas as LaTeX. **Requires `--hybrid-mode full` on the client.** | +| `--enrich-picture-description` | Off | Generate AI descriptions for charts and images. **Requires `--hybrid-mode full` on the client.** | + +**Example — scanned Korean document with formula extraction:** + +```bash +# Server +opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en" --enrich-formula + +# Client (must use --hybrid-mode full) +opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf +``` + +--- + +## Troubleshooting + +### "Connection refused" or server not reachable + +The server is not running or is on a different port/host. + +1. Confirm the server started without errors in Terminal 1. +2. Check the port matches on both sides (`--port` on server, `--hybrid-url` on client). +3. For a remote server, ensure the host is reachable and the firewall allows the port. + +```bash +# Test connectivity manually +curl http://localhost:5002/health +``` + +### Request timeout + +The backend is taking longer than the configured timeout. + +- Increase the timeout: `--hybrid-timeout 30000` (30 seconds) +- Or disable it: `--hybrid-timeout 0` +- If this is persistent, check backend resource usage (CPU/GPU). + +### Formulas or image descriptions missing from output + +This is the most common silent failure. Enrichment options on the server are only applied when the client sends the page to the backend. + +- In `auto` mode, pages classified as simple stay on Java — enrichments are never applied to them. +- **Fix:** Add `--hybrid-mode full` to your client command. + +No error or warning is emitted when enrichments are skipped. This is by design (the server processes what it receives), but it can be surprising. + +### Output quality is lower than expected for complex tables + +In `auto` mode, the triage heuristic may occasionally classify a complex table as simple. Switch to `--hybrid-mode full` to force all pages through the backend. + +--- + +## Backend Registry + +| Backend | Status | Features | +|---------|--------|----------| +| `docling-fast` | Available | OCR, formula extraction (LaTeX), chart descriptions, table enhancement | +| `hancom` | Planned | Hancom Document AI integration | +| `azure` | Planned | Azure AI Document Intelligence | +| `google` | Planned | Google Document AI | + +Backends are selected with `--hybrid `. Only one backend can be active per run. + +--- + +## Performance Notes + +| Processing path | Approximate throughput | +|-----------------|----------------------| +| Java only (no hybrid) | ~0.015 s/page | +| Hybrid `auto` (mixed document) | Varies; most pages stay at Java speed | +| Hybrid `full` | ~0.5 s/page (GPU-accelerated backend recommended) | + +Latency figures are approximate and depend on document complexity, available hardware, and backend configuration. Running the hybrid server on a machine with a GPU significantly reduces the per-page time in `full` mode. + +For throughput-sensitive workloads, use `auto` mode and reserve `full` mode for documents where enrichment or uniform OCR quality is required. diff --git a/skills/odl-pdf/references/installation-matrix.md b/skills/odl-pdf/references/installation-matrix.md new file mode 100644 index 000000000..1a8642e5c --- /dev/null +++ b/skills/odl-pdf/references/installation-matrix.md @@ -0,0 +1,98 @@ +# Installation Matrix + +This guide helps you choose the right installation method for your environment. + +## Decision Tree + +``` +Do you have Python available? +├── Yes +│ ├── Do you need LangChain integration? +│ │ └── Yes → pip install langchain-opendataloader-pdf +│ ├── Do you need hybrid server capability? +│ │ └── Yes → pip install "opendataloader-pdf[hybrid]" +│ └── Otherwise → pip install opendataloader-pdf (simplest) +├── Node.js only (no Python)? +│ └── npm install @opendataloader/pdf +├── Java project (Maven/Gradle)? +│ └── Add Maven dependency (see below) +└── Unsure? + └── pip install opendataloader-pdf (simplest, works on all platforms) +``` + +## Prerequisites + +**Java 11 or higher is required for all installation methods.** All methods spawn a JVM internally to perform PDF processing. + +If Java is missing when you run the tool, you will see: + +> Java 11 or higher is required. Please install a JDK for your environment. + +Install a JDK appropriate for your OS before proceeding. Verify with: + +``` +java -version +``` + +## Quick Start Commands + +### pip (Python) + +```bash +# Minimal install +pip install opendataloader-pdf + +# With hybrid server capability +pip install "opendataloader-pdf[hybrid]" + +# LangChain integration +pip install langchain-opendataloader-pdf +``` + +The `opendataloader-pdf` CLI command is included automatically with the pip install. + +### npm (Node.js) + +```bash +npm install @opendataloader/pdf +``` + +The `opendataloader-pdf` CLI command is included automatically with the npm install. + +### Maven (Java) + +Add to your `pom.xml`: + +```xml + + io.opendataloader + opendataloader-pdf + LATEST + +``` + +Replace `LATEST` with the specific version you want to pin. Check the [releases page](https://github.com/opendataloader/opendataloader-pdf/releases) for available versions. + +## Version Compatibility + +| Method | Minimum Runtime | CLI Included | +|---|---|---| +| pip | Python 3.8+ | Yes | +| pip [hybrid] | Python 3.8+ | Yes | +| pip langchain | Python 3.8+, LangChain 0.1+ | Yes | +| npm | Node.js 16+ | Yes | +| Maven | Java 11+ | No (library only) | + +All methods also require **Java 11+** regardless of the primary runtime. + +## Post-Install Verification + +After installing via pip or npm, confirm the CLI is working: + +``` +opendataloader-pdf --version +``` + +A successful output shows the installed version number. If the command is not found, ensure your package manager's bin directory is on your `PATH`. + +For Maven, verify the dependency resolves by running a build (`mvn compile`) and checking that no classpath errors are reported. diff --git a/skills/odl-pdf/references/options-matrix.md b/skills/odl-pdf/references/options-matrix.md new file mode 100644 index 000000000..f62dce7fd --- /dev/null +++ b/skills/odl-pdf/references/options-matrix.md @@ -0,0 +1,235 @@ +# ODL-PDF CLI Options Matrix + +This file contains a built-in summary of all 26 CLI options for the `opendataloader-pdf` tool. +If `options.json` is present in the project root, that file is the authoritative source — always +prefer it over the descriptions here. This document exists so the agent skill can reason about +options without loading the full JSON on every invocation. + +--- + +## Categories + +### IO — Input / Output Control + +Controls where data comes from and where results are written. + +| Option | Short | Type | Default | Description | +|---|---|---|---|---| +| `output-dir` | `-o` | string | null (input file dir) | Directory where output files are written. Defaults to the same directory as the input file. | +| `to-stdout` | — | boolean | false | Write output to stdout instead of a file. Only valid with a single format. | +| `quiet` | `-q` | boolean | false | Suppress all console logging output. | +| `password` | `-p` | string | null | Password for encrypted PDF files. | +| `pages` | — | string | null (all) | Pages to extract, e.g. `"1,3,5-7"`. Defaults to all pages. | +| `format` | `-f` | string | json | Output format(s), comma-separated. Values: `json`, `text`, `html`, `pdf`, `markdown`, `markdown-with-html`, `markdown-with-images`. | + +--- + +### Quality — Extraction Quality + +Controls the accuracy and structure of the extracted content. + +| Option | Short | Type | Default | Description | +|---|---|---|---|---| +| `table-method` | — | string | `default` | Table detection method. `default` = border-based; `cluster` = border + borderless cluster detection (slower). | +| `reading-order` | — | string | `xycut` | Reading order algorithm. `xycut` = XY-cut layout analysis; `off` = no reordering. | +| `use-struct-tree` | — | boolean | false | Use the PDF structure tree (tagged PDF) for reading order and semantic structure. Only effective on tagged PDFs. | + +--- + +### Safety — Security and Privacy + +Controls content filtering and sensitive data handling. + +| Option | Short | Type | Default | Description | +|---|---|---|---|---| +| `content-safety-off` | — | string | null | Disable specific content safety filters. Values: `all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg`. | +| `sanitize` | — | boolean | false | Replace emails, phone numbers, IP addresses, credit card numbers, and URLs with placeholders. | + +--- + +### Hybrid — AI Backend + +Options for routing pages through an optional AI enrichment server (e.g. formula OCR, picture descriptions). + +| Option | Short | Type | Default | Description | +|---|---|---|---|---| +| `hybrid` | — | string | `off` | Hybrid backend to use. Values: `off`, `docling-fast`. Requires a running hybrid server. | +| `hybrid-mode` | — | string | `auto` | Triage mode. `auto` = dynamic page-level triage; `full` = send all pages to the backend (required for server-side enrichments). | +| `hybrid-url` | — | string | null | Override the default hybrid server URL. | +| `hybrid-timeout` | — | string | `0` | Per-request timeout in milliseconds (`0` = no timeout). | +| `hybrid-fallback` | — | boolean | false | Fall back to the Java extraction path if the hybrid backend returns an error. | + +--- + +### Output — Output Formatting + +Controls how images and page separators appear in output files. + +| Option | Short | Type | Default | Description | +|---|---|---|---|---| +| `image-output` | — | string | `external` | Image output mode. `off` = skip images; `embedded` = Base64 data URIs inline; `external` = write separate image files and embed references. | +| `image-format` | — | string | `png` | Format for extracted images. Values: `png`, `jpeg`. | +| `image-dir` | — | string | null | Directory for extracted image files (used when `image-output` is `external`). | +| `markdown-page-separator` | — | string | null | String inserted between pages in Markdown output. Use `%page-number%` to include the page number. | +| `text-page-separator` | — | string | null | String inserted between pages in plain-text output. Use `%page-number%` for page numbers. | +| `html-page-separator` | — | string | null | String inserted between pages in HTML output. Use `%page-number%` for page numbers. | + +--- + +### Text — Text Processing + +Fine-grained control over how extracted text is cleaned and formatted. + +| Option | Short | Type | Default | Description | +|---|---|---|---|---| +| `keep-line-breaks` | — | boolean | false | Preserve the original line breaks from the PDF. By default, soft line breaks are merged. | +| `replace-invalid-chars` | — | string | `" "` (space) | Replacement character for invalid or unrecognized characters in the extracted text. | +| `include-header-footer` | — | boolean | false | Include page headers and footers in the output. Excluded by default. | +| `detect-strikethrough` | — | boolean | false | Detect strikethrough text (experimental). | + +--- + +## Interaction Rules + +These rules document option combinations that have non-obvious or silent failure modes. + +**1. Hybrid enrichments require `--hybrid-mode full`** + +Server-side enrichments such as `--enrich-formula` and `--enrich-picture-description` run on the +hybrid backend. On the client side, they are only applied if `--hybrid-mode full` is set. With the +default `auto` mode, pages that the triage step classifies as "simple" bypass the backend entirely, +and any enrichment instructions for those pages are silently ignored. If enrichments are missing +from the output, check that `--hybrid-mode full` is set. + +**2. `--hybrid` requires a running server** + +Setting `--hybrid docling-fast` (or any non-`off` value) without a reachable hybrid server will +cause requests to fail. Quick start: + +```bash +pip install "opendataloader-pdf[hybrid]" +opendataloader-pdf-hybrid --port 5002 +``` + +Then pass `--hybrid docling-fast --hybrid-url http://localhost:5002` to the client. + +**3. `--to-stdout` only works with a single format** + +`--to-stdout` writes the extracted content to standard output. It cannot be combined with +comma-separated `--format` values (e.g. `--format json,text`). Passing multiple formats with +`--to-stdout` will produce an error. When streaming output to another process, specify exactly one +format. + +**4. `--image-output embedded` produces large output for image-heavy PDFs** + +`embedded` mode encodes each image as a Base64 data URI and inlines it in the output document. +For PDFs with many or large images this can produce very large output files. Prefer `external` +(the default) unless the consumer requires self-contained output. + +**5. `--table-method cluster` may be slower** + +The `cluster` method adds borderless table detection on top of the default border-based approach. +It improves recall on tables without visible borders but increases processing time. Use `default` +when throughput matters and the PDFs have standard bordered tables. + +**6. `--use-struct-tree` has no effect on untagged PDFs** + +The structure tree option reads semantic order from the PDF's tag tree, which is only present in +tagged (accessible) PDFs. On untagged PDFs the option is silently ignored and the default layout +analysis is used instead. To check whether a PDF is tagged, inspect its document properties or +run a preflight check before enabling this option. + +--- + +## Common Combinations + +### RAG pipeline (retrieval-augmented generation) + +Extract clean, structured text with accurate reading order for vector indexing: + +```bash +opendataloader-pdf input.pdf \ + --format json \ + --reading-order xycut \ + --table-method cluster \ + --image-output off \ + --sanitize +``` + +Use `--sanitize` when the PDF may contain PII that should not enter the vector store. + +--- + +### Accessibility audit (tagged PDF) + +Leverage the PDF's tag tree to validate semantic structure and export accessible HTML: + +```bash +opendataloader-pdf input.pdf \ + --format html \ + --use-struct-tree \ + --include-header-footer \ + --html-page-separator "" +``` + +--- + +### Quick plain-text extraction + +Minimal options for fast extraction of readable prose: + +```bash +opendataloader-pdf input.pdf \ + --format text \ + --quiet \ + --to-stdout +``` + +Pipe directly to downstream tools: `opendataloader-pdf input.pdf -f text -q --to-stdout | wc -w` + +--- + +### Markdown with images for documentation + +Export a Markdown file with embedded images, suitable for wikis or documentation sites: + +```bash +opendataloader-pdf input.pdf \ + --format markdown-with-images \ + --image-output external \ + --image-format png \ + --image-dir ./images \ + --output-dir ./output +``` + +--- + +### AI-enriched extraction (hybrid mode) + +Extract all pages through the hybrid backend for formula OCR and picture descriptions: + +```bash +opendataloader-pdf input.pdf \ + --format markdown \ + --hybrid docling-fast \ + --hybrid-mode full \ + --hybrid-url http://localhost:5002 \ + --hybrid-fallback +``` + +`--hybrid-fallback` ensures that if the server is temporarily unavailable, extraction continues +with the local Java backend rather than failing. + +--- + +### Selective page extraction for large PDFs + +Extract only a specific page range to reduce processing time: + +```bash +opendataloader-pdf large-report.pdf \ + --pages "1,5-10,15" \ + --format json \ + --output-dir ./extracted \ + --quiet +``` diff --git a/skills/odl-pdf/scripts/detect-env.sh b/skills/odl-pdf/scripts/detect-env.sh new file mode 100644 index 000000000..3856fe734 --- /dev/null +++ b/skills/odl-pdf/scripts/detect-env.sh @@ -0,0 +1,172 @@ +#!/usr/bin/env bash +# detect-env.sh — Cross-platform environment detection for the odl-pdf agent skill. +# Outputs key=value pairs (one per line) to stdout. No other output. +# Make this file executable: chmod +x detect-env.sh + +set -euo pipefail + +# --------------------------------------------------------------------------- +# OS detection +# --------------------------------------------------------------------------- +detect_os() { + local raw + raw="$(uname -s 2>/dev/null || echo "unknown")" + case "${raw}" in + Darwin*) echo "macos" ;; + Linux*) echo "linux" ;; + MINGW*|MSYS*|CYGWIN*) echo "windows" ;; + *) echo "linux" ;; # best-effort fallback + esac +} + +# --------------------------------------------------------------------------- +# Java version (java outputs version to stderr) +# --------------------------------------------------------------------------- +detect_java() { + if ! command -v java &>/dev/null; then + echo "none" + return + fi + local raw + raw="$(java -version 2>&1 | head -1)" + # Handles formats: + # openjdk version "21.0.3" ... + # java version "1.8.0_401" ... + # openjdk version "11.0.22" ... + local ver + ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1)" + if [[ -z "${ver}" ]]; then + echo "none" + return + fi + # Normalise legacy 1.x format → major only; otherwise keep major + if [[ "${ver}" =~ ^1\.([0-9]+) ]]; then + echo "${BASH_REMATCH[1]}" + else + # Extract leading integer(s) before the first dot + local major + major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+')" + echo "${major:-none}" + fi +} + +# --------------------------------------------------------------------------- +# Python version (try python3 first, then python) +# --------------------------------------------------------------------------- +detect_python() { + local cmd="" + if command -v python3 &>/dev/null; then + cmd="python3" + elif command -v python &>/dev/null; then + cmd="python" + else + echo "none" + return + fi + local raw + raw="$("${cmd}" --version 2>&1 | head -1)" + # e.g. "Python 3.12.4" + local ver + ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?'| head -1)" + echo "${ver:-none}" +} + +# --------------------------------------------------------------------------- +# Node version +# --------------------------------------------------------------------------- +detect_node() { + if ! command -v node &>/dev/null; then + echo "none" + return + fi + local raw + raw="$(node --version 2>/dev/null)" + # e.g. "v20.19.0" → strip leading 'v' + local ver + ver="$(printf '%s' "${raw}" | sed 's/^v//')" + echo "${ver:-none}" +} + +# --------------------------------------------------------------------------- +# ODL installed + version +# Tries CLI first, then Python module. +# --------------------------------------------------------------------------- +detect_odl() { + local installed="false" + local version="none" + + # Determine python binary + local pycmd="" + if command -v python3 &>/dev/null; then + pycmd="python3" + elif command -v python &>/dev/null; then + pycmd="python" + fi + + # Try the CLI entry-point first + local cli_ver="" + if command -v opendataloader-pdf &>/dev/null; then + cli_ver="$(opendataloader-pdf --version 2>/dev/null || true)" + fi + + if [[ -n "${cli_ver}" ]]; then + installed="true" + version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + version="${version:-none}" + elif [[ -n "${pycmd}" ]]; then + # Try python -m opendataloader_pdf --version + local mod_ver + mod_ver="$("${pycmd}" -m opendataloader_pdf --version 2>/dev/null || true)" + if [[ -n "${mod_ver}" ]]; then + installed="true" + version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + version="${version:-none}" + else + # Last resort: importlib.metadata + local meta_ver + meta_ver="$("${pycmd}" -c "import importlib.metadata; print(importlib.metadata.version('opendataloader-pdf'))" 2>/dev/null || true)" + if [[ -n "${meta_ver}" ]]; then + installed="true" + version="${meta_ver}" + fi + fi + fi + + printf '%s\n' "ODL_INSTALLED=${installed}" + printf '%s\n' "ODL_VERSION=${version}" +} + +# --------------------------------------------------------------------------- +# Hybrid extras — check for docling_serve (primary indicator) +# --------------------------------------------------------------------------- +detect_hybrid_extras() { + local pycmd="" + if command -v python3 &>/dev/null; then + pycmd="python3" + elif command -v python &>/dev/null; then + pycmd="python" + fi + + if [[ -z "${pycmd}" ]]; then + echo "HYBRID_EXTRAS=false" + return + fi + + local result + result="$("${pycmd}" -c "import docling_serve; print('ok')" 2>/dev/null || true)" + if [[ "${result}" == "ok" ]]; then + echo "HYBRID_EXTRAS=true" + else + echo "HYBRID_EXTRAS=false" + fi +} + +# --------------------------------------------------------------------------- +# Main — emit all key=value pairs +# --------------------------------------------------------------------------- +printf '%s\n' "OS=$(detect_os)" +printf '%s\n' "JAVA=$(detect_java)" +printf '%s\n' "PYTHON=$(detect_python)" +printf '%s\n' "NODE=$(detect_node)" +detect_odl +detect_hybrid_extras diff --git a/skills/odl-pdf/scripts/hybrid-health.sh b/skills/odl-pdf/scripts/hybrid-health.sh new file mode 100644 index 000000000..856c0b340 --- /dev/null +++ b/skills/odl-pdf/scripts/hybrid-health.sh @@ -0,0 +1,67 @@ +#!/usr/bin/env bash +# hybrid-health.sh +# Checks the health of a running opendataloader-pdf hybrid server. +# Works on Windows (Git Bash), macOS, and Linux. +# Outputs key=value pairs for machine readability. + +set -euo pipefail + +DEFAULT_URL="http://localhost:5002" +HYBRID_URL="${DEFAULT_URL}" + +# Parse arguments +while [[ $# -gt 0 ]]; do + case "$1" in + --url) + HYBRID_URL="$2" + shift 2 + ;; + --url=*) + HYBRID_URL="${1#--url=}" + shift + ;; + *) + echo "Unknown argument: $1" >&2 + echo "Usage: $0 [--url ]" >&2 + exit 1 + ;; + esac +done + +HEALTH_ENDPOINT="${HYBRID_URL}/health" + +# Detect available HTTP client +_http_get_status() { + local url="$1" + if command -v curl &>/dev/null; then + curl --silent --output /dev/null --write-out "%{http_code}" \ + --max-time 5 --connect-timeout 3 "$url" 2>/dev/null + elif command -v wget &>/dev/null; then + wget --quiet --server-response --spider --timeout=5 "$url" 2>&1 \ + | awk '/HTTP\//{print $2}' | tail -1 + else + echo "none" + fi +} + +HTTP_STATUS=$(_http_get_status "${HEALTH_ENDPOINT}" || true) + +# Interpret result +if [[ -z "${HTTP_STATUS}" || "${HTTP_STATUS}" == "000" || "${HTTP_STATUS}" == "none" ]]; then + echo "HYBRID_SERVER=stopped" + echo "HYBRID_URL=${HYBRID_URL}" + echo "HYBRID_STATUS=none" + echo "" + echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --port 5002" + exit 0 +fi + +# Any 2xx response is considered running; other codes are an error state +if [[ "${HTTP_STATUS}" =~ ^2 ]]; then + echo "HYBRID_SERVER=running" +else + echo "HYBRID_SERVER=error" +fi + +echo "HYBRID_URL=${HYBRID_URL}" +echo "HYBRID_STATUS=${HTTP_STATUS}" diff --git a/skills/odl-pdf/scripts/quick-eval.py b/skills/odl-pdf/scripts/quick-eval.py new file mode 100644 index 000000000..03d5db058 --- /dev/null +++ b/skills/odl-pdf/scripts/quick-eval.py @@ -0,0 +1,284 @@ +#!/usr/bin/env python3 +"""Quick quality evaluation script for opendataloader-pdf output. + +Compares extracted text against a ground truth file and reports a similarity +score. Uses difflib.SequenceMatcher from the Python standard library by default. +If rapidfuzz is installed, it computes a more accurate Normalized Information +Distance (NID) score instead. + +Usage: + python quick-eval.py extracted.md ground-truth.md + python quick-eval.py extracted.md ground-truth.md --verbose + python quick-eval.py extracted.md ground-truth.md --threshold 0.90 +""" + +import argparse +import difflib +import re +import sys +from pathlib import Path + +# --------------------------------------------------------------------------- +# Optional rapidfuzz import — used for NID scoring when available +# --------------------------------------------------------------------------- +try: + from rapidfuzz.distance import Levenshtein + + _RAPIDFUZZ_AVAILABLE = True +except ImportError: + _RAPIDFUZZ_AVAILABLE = False + + +# --------------------------------------------------------------------------- +# Score thresholds and their human-readable interpretations +# --------------------------------------------------------------------------- +SCORE_LEVELS = [ + (0.95, "Excellent", "Output closely matches the ground truth."), + (0.85, "Good", "Minor differences; output is usable as-is."), + (0.70, "Fair", "Noticeable differences — consider hybrid mode or different options."), + (0.00, "Poor", "Significant quality issues — review extraction settings."), +] + + +def normalize(text: str) -> str: + """Collapse runs of whitespace to a single space and strip leading/trailing + whitespace. This makes the comparison insensitive to cosmetic formatting + differences such as extra blank lines or trailing spaces.""" + return re.sub(r"\s+", " ", text).strip() + + +def read_file(path: Path) -> str: + """Read a text file and return its content, normalized.""" + try: + raw = path.read_text(encoding="utf-8") + except UnicodeDecodeError: + # Fall back to Latin-1 for PDFs extracted without explicit encoding + raw = path.read_text(encoding="latin-1") + return normalize(raw) + + +def compute_similarity_stdlib(extracted: str, ground_truth: str) -> float: + """Return a similarity ratio in [0, 1] using difflib.SequenceMatcher. + + The ratio is defined as 2 * M / T, where M is the number of matching + characters and T is the total number of characters in both sequences. + This is equivalent to 1 - NID when strings share large common blocks. + """ + return difflib.SequenceMatcher(None, extracted, ground_truth, autojunk=False).ratio() + + +def compute_similarity_rapidfuzz(extracted: str, ground_truth: str) -> float: + """Return a similarity score in [0, 1] using rapidfuzz Levenshtein distance. + + Computes Normalized Information Distance: + NID = edit_distance / max(len(a), len(b)) + The similarity score returned is 1 - NID, so higher is better. + """ + max_len = max(len(extracted), len(ground_truth)) + if max_len == 0: + return 1.0 + distance = Levenshtein.distance(extracted, ground_truth) + nid = distance / max_len + return max(0.0, 1.0 - nid) + + +def compute_similarity(extracted: str, ground_truth: str) -> tuple[float, str]: + """Compute similarity score using the best available method. + + Returns: + (score, method_name) where score is in [0, 1]. + """ + if _RAPIDFUZZ_AVAILABLE: + return compute_similarity_rapidfuzz(extracted, ground_truth), "NID (rapidfuzz)" + return compute_similarity_stdlib(extracted, ground_truth), "SequenceMatcher ratio (difflib)" + + +def interpret_score(score: float) -> tuple[str, str]: + """Return (label, description) for a given score.""" + for threshold, label, description in SCORE_LEVELS: + if score >= threshold: + return label, description + # Should never reach here, but guard anyway + return "Poor", SCORE_LEVELS[-1][2] + + +def diff_snippets(extracted: str, ground_truth: str, max_snippets: int = 5) -> list[str]: + """Return up to max_snippets diff hunks for low-scoring sections. + + Uses difflib.unified_diff on word-tokenised lines so the output is readable + even for long single-line documents. + """ + # Re-wrap into ~80-char logical lines for readability + def wrap_words(text: str, width: int = 80) -> list[str]: + words = text.split() + lines: list[str] = [] + line: list[str] = [] + length = 0 + for word in words: + if length + len(word) + 1 > width and line: + lines.append(" ".join(line)) + line = [word] + length = len(word) + else: + line.append(word) + length += len(word) + 1 + if line: + lines.append(" ".join(line)) + return lines + + ext_lines = wrap_words(extracted) + gt_lines = wrap_words(ground_truth) + + diff = list( + difflib.unified_diff( + gt_lines, + ext_lines, + fromfile="ground-truth", + tofile="extracted", + lineterm="", + n=2, + ) + ) + + # Collect individual hunks (separated by @@ markers) + snippets: list[str] = [] + current_hunk: list[str] = [] + for line in diff: + if line.startswith("@@") and current_hunk: + snippets.append("\n".join(current_hunk)) + current_hunk = [line] + if len(snippets) >= max_snippets: + break + else: + current_hunk.append(line) + if current_hunk and len(snippets) < max_snippets: + snippets.append("\n".join(current_hunk)) + + return snippets + + +def build_report( + extracted_path: Path, + ground_truth_path: Path, + score: float, + method: str, + threshold: float, + verbose: bool, + extracted: str, + ground_truth: str, +) -> str: + """Assemble the formatted report string.""" + label, description = interpret_score(score) + passed = score >= threshold + status = "PASS" if passed else "FAIL" + + lines = [ + "=" * 60, + "ODL-PDF Quick Quality Evaluation", + "=" * 60, + f"Extracted: {extracted_path}", + f"Ground truth: {ground_truth_path}", + f"Method: {method}", + "-" * 60, + f"Score: {score:.4f} [{label}]", + f"Threshold: {threshold:.2f}", + f"Result: {status}", + "-" * 60, + f"Interpretation: {description}", + ] + + if not passed: + lines.append("") + lines.append("Suggestions:") + if score < 0.70: + lines.append(" - Try --hybrid docling-fast for better OCR coverage.") + lines.append(" - Check --format is appropriate for this document type.") + lines.append(" - Inspect whether the PDF is scanned (image-only) vs. native.") + elif score < 0.85: + lines.append(" - Consider --hybrid docling-fast or --table-method cluster.") + lines.append(" - Try --use-struct-tree if the PDF is tagged (accessible).") + + if verbose: + lines.append("") + lines.append("Diff snippets (ground-truth → extracted):") + snippets = diff_snippets(extracted, ground_truth) + if snippets: + for i, snippet in enumerate(snippets, 1): + lines.append(f"\n--- Hunk {i} ---") + lines.append(snippet) + else: + lines.append(" (no differences found)") + + lines.append("=" * 60) + return "\n".join(lines) + + +def parse_args(argv: list[str] | None = None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Compare ODL-PDF extracted output against a ground truth file.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=__doc__, + ) + parser.add_argument( + "extracted", + type=Path, + help="Path to the extracted text file produced by opendataloader-pdf.", + ) + parser.add_argument( + "ground_truth", + type=Path, + help="Path to the ground truth reference file.", + ) + parser.add_argument( + "--threshold", + type=float, + default=0.85, + metavar="T", + help="Pass/fail threshold in [0, 1]. Default: 0.85.", + ) + parser.add_argument( + "--verbose", + action="store_true", + help="Show diff snippets for sections where the files diverge.", + ) + return parser.parse_args(argv) + + +def main(argv: list[str] | None = None) -> int: + args = parse_args(argv) + + # Validate input paths + if not args.extracted.is_file(): + print(f"ERROR: Extracted file not found: {args.extracted}", file=sys.stderr) + return 2 + if not args.ground_truth.is_file(): + print(f"ERROR: Ground truth file not found: {args.ground_truth}", file=sys.stderr) + return 2 + if not (0.0 <= args.threshold <= 1.0): + print(f"ERROR: --threshold must be between 0 and 1, got {args.threshold}", file=sys.stderr) + return 2 + + extracted = read_file(args.extracted) + ground_truth = read_file(args.ground_truth) + + score, method = compute_similarity(extracted, ground_truth) + + report = build_report( + extracted_path=args.extracted, + ground_truth_path=args.ground_truth, + score=score, + method=method, + threshold=args.threshold, + verbose=args.verbose, + extracted=extracted, + ground_truth=ground_truth, + ) + + print(report) + + # Exit 0 = pass, 1 = fail (score below threshold) + return 0 if score >= args.threshold else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/odl-pdf/scripts/sync-skill-refs.py b/skills/odl-pdf/scripts/sync-skill-refs.py new file mode 100644 index 000000000..cf7ab5925 --- /dev/null +++ b/skills/odl-pdf/scripts/sync-skill-refs.py @@ -0,0 +1,195 @@ +#!/usr/bin/env python3 +"""Drift detection script for the ODL-PDF agent skill. + +Compares the option names declared in options.json (the authoritative source) +against the option names documented in skills/odl-pdf/references/options-matrix.md. + +Any mismatch means the skill reference is out of sync with the actual CLI — +a condition referred to here as "drift". Run this script in CI after any +change to options.json or options-matrix.md. + +Usage: + python sync-skill-refs.py + python sync-skill-refs.py --options-json path/to/options.json \ + --matrix path/to/options-matrix.md + +Exit codes: + 0 No drift detected. + 1 Drift detected (new or removed options). + 2 Input error (file not found, invalid JSON, etc.). +""" + +import argparse +import io +import json +import re +import sys +from pathlib import Path + +# Reconfigure stdout to UTF-8 when running on Windows with a legacy code page +# so that Unicode symbols (checkmark, cross) print correctly in all terminals. +if hasattr(sys.stdout, "reconfigure"): + try: + sys.stdout.reconfigure(encoding="utf-8") + except Exception: + pass + +# --------------------------------------------------------------------------- +# Defaults — resolved relative to this script's location so the script works +# when invoked from any directory. +# --------------------------------------------------------------------------- +_SCRIPT_DIR = Path(__file__).parent.resolve() +# skills/odl-pdf/scripts/ → project root is three levels up +_PROJECT_ROOT = _SCRIPT_DIR.parent.parent.parent + +DEFAULT_OPTIONS_JSON = _PROJECT_ROOT / "options.json" +DEFAULT_MATRIX = _SCRIPT_DIR.parent / "references" / "options-matrix.md" + + +# --------------------------------------------------------------------------- +# Parsing helpers +# --------------------------------------------------------------------------- + +def load_option_names_from_json(path: Path) -> set[str]: + """Return the set of option names declared in options.json. + + Expects the file to contain a top-level object with an "options" array, + where each element has a "name" field. Example: + + { "options": [ { "name": "output-dir", ... }, ... ] } + """ + try: + data = json.loads(path.read_text(encoding="utf-8")) + except json.JSONDecodeError as exc: + print(f"ERROR: Failed to parse {path}: {exc}", file=sys.stderr) + sys.exit(2) + + options = data.get("options") + if not isinstance(options, list): + print( + f"ERROR: {path} does not contain a top-level 'options' array.", + file=sys.stderr, + ) + sys.exit(2) + + names: set[str] = set() + for i, item in enumerate(options): + if not isinstance(item, dict) or "name" not in item: + print( + f"ERROR: options[{i}] in {path} is missing the 'name' field.", + file=sys.stderr, + ) + sys.exit(2) + names.add(item["name"]) + + return names + + +def load_option_names_from_matrix(path: Path) -> set[str]: + """Return the set of option names found in options-matrix.md. + + Scans all Markdown table rows and extracts backtick-quoted option names + from the first column. Rows that contain only header separators (---) are + skipped. + + Expected table format (any number of columns): + | `option-name` | ... | + """ + text = path.read_text(encoding="utf-8") + + names: set[str] = set() + + # Match table rows whose first cell contains a backtick-quoted token. + # This pattern is intentionally permissive so it works even if the table + # adds extra spaces or alignment padding. + row_pattern = re.compile( + r"^\s*\|\s*`([^`]+)`", # | `option-name` (first cell, backtick-quoted) + re.MULTILINE, + ) + for match in row_pattern.finditer(text): + candidate = match.group(1).strip() + # Skip tokens that look like option values rather than names. + # Option names always contain at least one letter and may contain + # hyphens but not spaces or equals signs. + if re.fullmatch(r"[a-z][a-z0-9-]*", candidate): + names.add(candidate) + + return names + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + +def parse_args(argv: list[str] | None = None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Detect drift between options.json and the skill reference matrix.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=__doc__, + ) + parser.add_argument( + "--options-json", + type=Path, + default=DEFAULT_OPTIONS_JSON, + metavar="PATH", + help=f"Path to options.json. Default: {DEFAULT_OPTIONS_JSON}", + ) + parser.add_argument( + "--matrix", + type=Path, + default=DEFAULT_MATRIX, + metavar="PATH", + help=f"Path to options-matrix.md. Default: {DEFAULT_MATRIX}", + ) + return parser.parse_args(argv) + + +def main(argv: list[str] | None = None) -> int: + args = parse_args(argv) + + # Validate input paths + if not args.options_json.is_file(): + print(f"ERROR: options.json not found: {args.options_json}", file=sys.stderr) + return 2 + if not args.matrix.is_file(): + print(f"ERROR: options-matrix.md not found: {args.matrix}", file=sys.stderr) + return 2 + + print("Checking skill drift...") + + json_names = load_option_names_from_json(args.options_json) + matrix_names = load_option_names_from_matrix(args.matrix) + + print(f"options.json: {len(json_names)} options") + print(f"options-matrix.md: {len(matrix_names)} options") + + # Compute drift sets + new_options = sorted(json_names - matrix_names) # in JSON but not in matrix + removed_options = sorted(matrix_names - json_names) # in matrix but not in JSON + + drift_detected = bool(new_options or removed_options) + + if not drift_detected: + print("\u2713 No drift detected.") + return 0 + + # Report drift + if new_options: + print(f"\nNEW options (in options.json, not in skill):") + for name in new_options: + print(f" - {name}") + + if removed_options: + print(f"\nREMOVED options (in skill, not in options.json):") + for name in removed_options: + print(f" - {name}") + + print( + "\n\u2717 Drift detected. " + "Update skills/odl-pdf/references/options-matrix.md to match options.json." + ) + return 1 + + +if __name__ == "__main__": + sys.exit(main()) From faf526c6b096d5da927d43fd7bab14307fb4bbad Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 9 Apr 2026 13:09:30 +0900 Subject: [PATCH 02/13] refactor: apply review feedback (round 1) Objective: Code review identified 6 issues across CI workflow, shell scripts, SKILL.md references, and .gitignore. Approach: Accept 6 items, reject 4 items with technical reasoning. - CI: remove nonexistent --fix flag from error message, fix unreachable exit-code logic with set +e - Shell: add executable bit (100755) to detect-env.sh, hybrid-health.sh - SKILL.md: add eval-metrics.md to reference table, fix quick-eval.py usage from --input/--reference flags to positional args - .gitignore: remove duplicate __pycache__ entry (already covered by **/__pycache__/ on line 53) Evidence: git ls-tree confirms 100755 for .sh files. grep confirms no remaining --fix references in CI. SKILL.md quick-eval.py example matches argparse positional interface. Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/skill-drift-check.yml | 6 ++++-- .gitignore | 2 -- skills/odl-pdf/SKILL.md | 5 ++--- skills/odl-pdf/scripts/detect-env.sh | 0 skills/odl-pdf/scripts/hybrid-health.sh | 0 5 files changed, 6 insertions(+), 7 deletions(-) mode change 100644 => 100755 skills/odl-pdf/scripts/detect-env.sh mode change 100644 => 100755 skills/odl-pdf/scripts/hybrid-health.sh diff --git a/.github/workflows/skill-drift-check.yml b/.github/workflows/skill-drift-check.yml index cc143a565..00bc7e001 100644 --- a/.github/workflows/skill-drift-check.yml +++ b/.github/workflows/skill-drift-check.yml @@ -25,10 +25,12 @@ jobs: - name: Check skill drift run: | + set +e python skills/odl-pdf/scripts/sync-skill-refs.py - if [ $? -ne 0 ]; then + EXIT_CODE=$? + if [ $EXIT_CODE -ne 0 ]; then echo "" echo "Drift detected: skill references are out of sync with options.json." - echo "Run 'python skills/odl-pdf/scripts/sync-skill-refs.py --fix' locally to update them." + echo "Update skills/odl-pdf/references/options-matrix.md to match options.json." exit 1 fi diff --git a/.gitignore b/.gitignore index ff24dbeb1..1cdc2084a 100644 --- a/.gitignore +++ b/.gitignore @@ -75,5 +75,3 @@ logs/ # Configuration files .claude/settings.local.json .claude/plans/ - -skills/odl-pdf/scripts/__pycache__/ diff --git a/skills/odl-pdf/SKILL.md b/skills/odl-pdf/SKILL.md index 0ee0efc2f..2b665497a 100644 --- a/skills/odl-pdf/SKILL.md +++ b/skills/odl-pdf/SKILL.md @@ -363,9 +363,7 @@ When extraction quality is inadequate. Start with measurement, then escalate. Run the quick evaluation script against your output: ```bash -python skills/odl-pdf/scripts/quick-eval.py \ - --input output/document.json \ - --reference ground-truth.json +python skills/odl-pdf/scripts/quick-eval.py output/document.md ground-truth.md ``` Or run the full benchmark to get NID, TEDS, and MHS scores: @@ -686,6 +684,7 @@ Load these files progressively — only when entering the relevant topic. Do not | `references/options-matrix.md` | User needs detailed option documentation, defaults, or interactions | | `references/hybrid-guide.md` | User needs hybrid server setup, server-side flags, or remote deployment | | `references/format-guide.md` | User needs output format comparison, format-specific behavior, or format selection | +| `references/eval-metrics.md` | User needs detailed metric definitions (NID, TEDS, MHS), benchmark scores, or diagnostic steps by metric | | `scripts/detect-env.sh` | Phase 1 environment detection — run at session start | | `scripts/quick-eval.py` | Phase 4 quality measurement — run when diagnosing extraction quality | | `evals/` | Benchmark baselines and regression thresholds | diff --git a/skills/odl-pdf/scripts/detect-env.sh b/skills/odl-pdf/scripts/detect-env.sh old mode 100644 new mode 100755 diff --git a/skills/odl-pdf/scripts/hybrid-health.sh b/skills/odl-pdf/scripts/hybrid-health.sh old mode 100644 new mode 100755 From 98c6cc75e402ecc904e8ff05818fba86a8f80ada Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 9 Apr 2026 13:55:01 +0900 Subject: [PATCH 03/13] refactor: apply PR review feedback (round 2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: CodeRabbit and CodeQL identified 24 issues across skill files — factual errors, security, and edge cases. Approach: Accept 15 items, reject 9 with technical reasoning. Critical fixes: - Python/Node.js/Java API examples replaced with actual APIs (opendataloader_pdf.convert, convert from @opendataloader/pdf, OpenDataLoaderPDF.processFile) - NID renamed to Normalized Indel Distance (was Information/Inversion) - MHS renamed to Markdown Heading Similarity (was Mean) - Hybrid install corrected to pip install "opendataloader-pdf[hybrid]" (was separate pip install opendataloader-pdf-hybrid) - detect-env.sh: grep pipefail fix (|| true), unknown OS fallback, hybrid detection via import docling (was docling_serve) Other fixes: - CI: added permissions { contents: read } - installation-matrix.md: Python 3.10+, Node 20.19+, Maven coordinates - hybrid-guide.md: timeout default 0, docling-serve description - SKILL.md: jq pipe format, bash continuation, env table none/missing - hybrid-health.sh: --url guard, dynamic port hint - format-guide.md: image-output values corrected - installation-matrix.md: GitHub URL corrected Rejected (with push back): - CI trigger expansion to options-matrix.md (false positives) - markdownlint code block language tags (no markdownlint CI) - SKILL.md top-level await concern (example snippet, not executed) - contextlib.suppress suggestion (reviewer agreed acceptable) - f-string placeholder linting (no functional impact) - hybrid-health.sh exit code (key=value parsing, no caller) - CI auto-PR creation (v1.0 scope, tracked as follow-up) Evidence: opus audit verified 17 previously-flagged checks + 9 fresh scan checks all PASS. Drift check 26/26 options in sync. Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/skill-drift-check.yml | 3 + skills/odl-pdf/SKILL.md | 68 ++++++++----------- skills/odl-pdf/references/eval-metrics.md | 6 +- skills/odl-pdf/references/format-guide.md | 6 +- skills/odl-pdf/references/hybrid-guide.md | 10 +-- .../odl-pdf/references/installation-matrix.md | 14 ++-- skills/odl-pdf/scripts/detect-env.sh | 16 ++--- skills/odl-pdf/scripts/hybrid-health.sh | 6 +- skills/odl-pdf/scripts/quick-eval.py | 4 +- 9 files changed, 65 insertions(+), 68 deletions(-) diff --git a/.github/workflows/skill-drift-check.yml b/.github/workflows/skill-drift-check.yml index 00bc7e001..757247a55 100644 --- a/.github/workflows/skill-drift-check.yml +++ b/.github/workflows/skill-drift-check.yml @@ -13,6 +13,9 @@ on: - 'options.json' workflow_dispatch: +permissions: + contents: read + jobs: check-drift: runs-on: ubuntu-latest diff --git a/skills/odl-pdf/SKILL.md b/skills/odl-pdf/SKILL.md index 2b665497a..cbb7bdfde 100644 --- a/skills/odl-pdf/SKILL.md +++ b/skills/odl-pdf/SKILL.md @@ -61,9 +61,9 @@ The script outputs key=value pairs. Parse these fields: | Key | Meaning | |-----|---------| | `OS` | Operating system (linux, macos, windows) | -| `JAVA` | Java version detected (e.g., `17.0.9`) or `missing` | -| `PYTHON` | Python version or `missing` | -| `NODE` | Node.js version or `missing` | +| `JAVA` | Java major version (e.g., `21`) or `none` | +| `PYTHON` | Python version (e.g., `3.12.4`) or `none` | +| `NODE` | Node.js version (e.g., `20.19.0`) or `none` | | `ODL_INSTALLED` | `true` or `false` | | `ODL_VERSION` | Installed version (e.g., `2.3.1`) or `none` | | `HYBRID_EXTRAS` | `true` if `[hybrid]` extras are installed | @@ -251,36 +251,27 @@ opendataloader-pdf input.pdf \ **Python:** ```python -from opendataloader_pdf import PdfConverter, ConversionOptions - -options = ConversionOptions( - format=["markdown"], - hybrid="docling-fast", - output_dir="./output" +import opendataloader_pdf + +# Batch all files in one call — each convert() spawns a JVM, so repeated calls are slow +opendataloader_pdf.convert( + input_path=["file1.pdf", "file2.pdf", "file3.pdf"], + output_dir="./output", + format="markdown", + hybrid="docling-fast" ) - -converter = PdfConverter(options) - -# Process all files in a single batch call — avoids multiple JVM startups -results = converter.convert(["file1.pdf", "file2.pdf", "file3.pdf"]) - -for result in results: - print(result.markdown) ``` **Node.js:** ```javascript -const { PdfConverter } = require('@opendataloader/pdf'); +import { convert } from '@opendataloader/pdf'; -const converter = new PdfConverter({ - format: ['markdown'], - hybrid: 'docling-fast', - outputDir: './output' +// Batch all files in one call — each convert() spawns a JVM, so repeated calls are slow +await convert(['file1.pdf', 'file2.pdf'], { + outputDir: './output', + format: 'markdown', + hybrid: 'docling-fast' }); - -// Batch all files in one call -const results = await converter.convert(['file1.pdf', 'file2.pdf']); -results.forEach(r => console.log(r.markdown)); ``` **LangChain integration:** @@ -299,16 +290,15 @@ documents = loader.load() **Java (Maven project):** ```java -PdfConversionOptions options = PdfConversionOptions.builder() - .format(List.of("markdown")) - .hybrid("docling-fast") - .outputDir(Path.of("./output")) - .build(); +import org.opendataloader.pdf.api.Config; +import org.opendataloader.pdf.api.OpenDataLoaderPDF; + +Config config = new Config(); +config.setOutputDir("./output"); +config.setFormat("markdown"); +config.setHybrid("docling-fast"); -PdfConverter converter = new PdfConverter(options); -List results = converter.convert(List.of( - Path.of("file1.pdf"), Path.of("file2.pdf") -)); +OpenDataLoaderPDF.processFile("file1.pdf", config); ``` ### 3B. Action Mode @@ -553,7 +543,7 @@ opendataloader-pdf input.pdf --format markdown --quiet **Stdout for pipe-based workflows** — single format, output to stdout: ```bash -opendataloader-pdf input.pdf --format markdown --to-stdout | jq . +opendataloader-pdf input.pdf --format json --to-stdout | jq . ``` **Page range extraction** — process only relevant pages: @@ -602,7 +592,7 @@ Do NOT recommend specific distributions or provide download links. # Client opendataloader-pdf input.pdf \ --hybrid docling-fast \ - --hybrid-mode full \ # required for enrichments + --hybrid-mode full \ --format markdown # Server (started separately) @@ -697,9 +687,9 @@ When running benchmarks or evaluating extraction quality, these are the five met | Metric | Full Name | What It Measures | Target | |--------|-----------|-----------------|--------| -| NID | Normalized Inversion Distance | Reading order correctness (sequence of extracted elements) | Higher is better (max 1.0) | +| NID | Normalized Indel Distance | Reading order correctness (sequence of extracted elements) | Higher is better (max 1.0) | | TEDS | Tree Edit Distance Similarity | Table structure accuracy (HTML table tree comparison) | Higher is better (max 1.0) | -| MHS | Mean Heading Similarity | Heading hierarchy accuracy (section structure) | Higher is better (max 1.0) | +| MHS | Markdown Heading Similarity | Heading hierarchy accuracy (section structure) | Higher is better (max 1.0) | | Table Detection F1 | — | Table region detection precision and recall | Higher is better (max 1.0) | | Speed | Pages/second | Extraction throughput | Context-dependent | diff --git a/skills/odl-pdf/references/eval-metrics.md b/skills/odl-pdf/references/eval-metrics.md index 6f09c58ee..f44a02eb1 100644 --- a/skills/odl-pdf/references/eval-metrics.md +++ b/skills/odl-pdf/references/eval-metrics.md @@ -6,7 +6,7 @@ This document explains the metrics used in opendataloader-pdf benchmarks, how to ## Metrics -### NID — Normalized Information Distance +### NID — Normalized Indel Distance **What it measures:** Reading order accuracy. Quantifies how well the extracted text preserves the correct reading sequence compared to the ground truth. @@ -180,7 +180,7 @@ Additional flags: ### Quick eval on your own documents ```bash -python skills/odl-pdf/scripts/quick-eval.py +python skills/odl-pdf/scripts/quick-eval.py extracted.md ground-truth.md ``` -This script runs a subset evaluation suitable for rapid iteration. It processes a small representative sample and reports per-metric scores without requiring the full benchmark corpus. +This script compares an extracted file against a ground truth reference using text similarity (difflib by default, rapidfuzz if available). It reports a similarity score with pass/fail against a configurable threshold (default 0.85). Use `--verbose` for diff snippets. diff --git a/skills/odl-pdf/references/format-guide.md b/skills/odl-pdf/references/format-guide.md index ff767a311..d878f6ed8 100644 --- a/skills/odl-pdf/references/format-guide.md +++ b/skills/odl-pdf/references/format-guide.md @@ -33,9 +33,9 @@ Choose your format based on what you're building: These options affect output when using image-bearing or multi-page formats: -- `image-output` — Controls whether images are embedded (base64) or written to files (`dir`). -- `image-format` — Image encoding format for extracted images (e.g., `png`, `jpeg`). -- `image-dir` — Directory path for externalized images when `image-output=dir`. +- `image-output` — Controls whether images are off, embedded (base64), or written to external files. Values: `off`, `embedded`, `external` (default). +- `image-format` — Image encoding format for extracted images. Values: `png` (default), `jpeg`. +- `image-dir` — Directory path for externalized images when `image-output=external`. - `*-page-separator` — Format-specific option to insert a custom separator between pages (e.g., `markdown-page-separator`, `text-page-separator`). ## Tips diff --git a/skills/odl-pdf/references/hybrid-guide.md b/skills/odl-pdf/references/hybrid-guide.md index 899e7ac62..50d98583f 100644 --- a/skills/odl-pdf/references/hybrid-guide.md +++ b/skills/odl-pdf/references/hybrid-guide.md @@ -6,7 +6,7 @@ Hybrid mode extends opendataloader-pdf by routing complex PDF pages to an extern ## Overview -By default, opendataloader-pdf processes everything locally in Java. Hybrid mode adds a second processing path — a Python-based server running [docling-serve](https://github.com/DS4SD/docling-serve) — and routes pages between the two based on complexity. +By default, opendataloader-pdf processes everything locally in Java. Hybrid mode adds a second processing path — a built-in Python server (`opendataloader-pdf-hybrid`) that uses the docling library internally — and routes pages between the two based on complexity. **When you need hybrid mode:** @@ -25,10 +25,10 @@ Hybrid mode requires two running processes: the server and the client. **Terminal 1 — Start the hybrid server:** ```bash -# Install the server component -pip install opendataloader-pdf-hybrid +# Install with hybrid extras (includes the server) +pip install "opendataloader-pdf[hybrid]" -# Start with defaults (port 5002) +# Start the hybrid server (port 5002) opendataloader-pdf-hybrid --port 5002 ``` @@ -81,7 +81,7 @@ Expected throughput with `full`: approximately 0.5 s/page (depends on backend an | `--hybrid ` | `off`, `docling-fast` | `off` | Select the backend. `off` disables hybrid mode entirely. | | `--hybrid-mode ` | `auto`, `full` | `auto` | Page routing strategy. | | `--hybrid-url ` | Any URL | `http://localhost:5002` | Override the server URL for remote or non-default setups. | -| `--hybrid-timeout ` | Integer | — | Request timeout in milliseconds. Set to `0` to disable timeout. | +| `--hybrid-timeout ` | Integer | `0` (no timeout) | Request timeout in milliseconds. `0` means no timeout. | | `--hybrid-fallback` | Flag | Disabled | Fall back to the Java path if the backend returns an error. | --- diff --git a/skills/odl-pdf/references/installation-matrix.md b/skills/odl-pdf/references/installation-matrix.md index 1a8642e5c..7d8a93d86 100644 --- a/skills/odl-pdf/references/installation-matrix.md +++ b/skills/odl-pdf/references/installation-matrix.md @@ -65,22 +65,22 @@ Add to your `pom.xml`: ```xml - io.opendataloader - opendataloader-pdf + org.opendataloader + opendataloader-pdf-core LATEST ``` -Replace `LATEST` with the specific version you want to pin. Check the [releases page](https://github.com/opendataloader/opendataloader-pdf/releases) for available versions. +Replace `LATEST` with the specific version you want to pin. Check the [releases page](https://github.com/opendataloader-project/opendataloader-pdf/releases) for available versions. ## Version Compatibility | Method | Minimum Runtime | CLI Included | |---|---|---| -| pip | Python 3.8+ | Yes | -| pip [hybrid] | Python 3.8+ | Yes | -| pip langchain | Python 3.8+, LangChain 0.1+ | Yes | -| npm | Node.js 16+ | Yes | +| pip | Python 3.10+ | Yes | +| pip [hybrid] | Python 3.10+ | Yes | +| pip langchain | Python 3.10+, LangChain 0.1+ | Yes | +| npm | Node.js 20.19+ | Yes | | Maven | Java 11+ | No (library only) | All methods also require **Java 11+** regardless of the primary runtime. diff --git a/skills/odl-pdf/scripts/detect-env.sh b/skills/odl-pdf/scripts/detect-env.sh index 3856fe734..366d41baf 100755 --- a/skills/odl-pdf/scripts/detect-env.sh +++ b/skills/odl-pdf/scripts/detect-env.sh @@ -15,7 +15,7 @@ detect_os() { Darwin*) echo "macos" ;; Linux*) echo "linux" ;; MINGW*|MSYS*|CYGWIN*) echo "windows" ;; - *) echo "linux" ;; # best-effort fallback + *) echo "unknown" ;; esac } @@ -34,7 +34,7 @@ detect_java() { # java version "1.8.0_401" ... # openjdk version "11.0.22" ... local ver - ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1)" + ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1 || true)" if [[ -z "${ver}" ]]; then echo "none" return @@ -45,7 +45,7 @@ detect_java() { else # Extract leading integer(s) before the first dot local major - major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+')" + major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+' || true)" echo "${major:-none}" fi } @@ -67,7 +67,7 @@ detect_python() { raw="$("${cmd}" --version 2>&1 | head -1)" # e.g. "Python 3.12.4" local ver - ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?'| head -1)" + ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?' | head -1 || true)" echo "${ver:-none}" } @@ -111,7 +111,7 @@ detect_odl() { if [[ -n "${cli_ver}" ]]; then installed="true" - version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1 || true)" version="${version:-none}" elif [[ -n "${pycmd}" ]]; then # Try python -m opendataloader_pdf --version @@ -119,7 +119,7 @@ detect_odl() { mod_ver="$("${pycmd}" -m opendataloader_pdf --version 2>/dev/null || true)" if [[ -n "${mod_ver}" ]]; then installed="true" - version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1 || true)" version="${version:-none}" else # Last resort: importlib.metadata @@ -137,7 +137,7 @@ detect_odl() { } # --------------------------------------------------------------------------- -# Hybrid extras — check for docling_serve (primary indicator) +# Hybrid extras — check for docling (primary indicator) # --------------------------------------------------------------------------- detect_hybrid_extras() { local pycmd="" @@ -153,7 +153,7 @@ detect_hybrid_extras() { fi local result - result="$("${pycmd}" -c "import docling_serve; print('ok')" 2>/dev/null || true)" + result="$("${pycmd}" -c "import docling; print('ok')" 2>/dev/null || true)" if [[ "${result}" == "ok" ]]; then echo "HYBRID_EXTRAS=true" else diff --git a/skills/odl-pdf/scripts/hybrid-health.sh b/skills/odl-pdf/scripts/hybrid-health.sh index 856c0b340..290b4f4bc 100755 --- a/skills/odl-pdf/scripts/hybrid-health.sh +++ b/skills/odl-pdf/scripts/hybrid-health.sh @@ -13,6 +13,10 @@ HYBRID_URL="${DEFAULT_URL}" while [[ $# -gt 0 ]]; do case "$1" in --url) + if [[ $# -lt 2 ]]; then + echo "Error: --url requires a value" >&2 + exit 1 + fi HYBRID_URL="$2" shift 2 ;; @@ -52,7 +56,7 @@ if [[ -z "${HTTP_STATUS}" || "${HTTP_STATUS}" == "000" || "${HTTP_STATUS}" == "n echo "HYBRID_URL=${HYBRID_URL}" echo "HYBRID_STATUS=none" echo "" - echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --port 5002" + echo "Hybrid server is not running at ${HYBRID_URL}. Start it with: opendataloader-pdf-hybrid" exit 0 fi diff --git a/skills/odl-pdf/scripts/quick-eval.py b/skills/odl-pdf/scripts/quick-eval.py index 03d5db058..3352005d0 100644 --- a/skills/odl-pdf/scripts/quick-eval.py +++ b/skills/odl-pdf/scripts/quick-eval.py @@ -3,7 +3,7 @@ Compares extracted text against a ground truth file and reports a similarity score. Uses difflib.SequenceMatcher from the Python standard library by default. -If rapidfuzz is installed, it computes a more accurate Normalized Information +If rapidfuzz is installed, it computes a more accurate Normalized Indel Distance (NID) score instead. Usage: @@ -70,7 +70,7 @@ def compute_similarity_stdlib(extracted: str, ground_truth: str) -> float: def compute_similarity_rapidfuzz(extracted: str, ground_truth: str) -> float: """Return a similarity score in [0, 1] using rapidfuzz Levenshtein distance. - Computes Normalized Information Distance: + Computes Normalized Indel Distance: NID = edit_distance / max(len(a), len(b)) The similarity score returned is 1 - NID, so higher is better. """ From c9d3d4827186ea3ddd4fa57b4e9eee184a6205ae Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 9 Apr 2026 15:30:37 +0900 Subject: [PATCH 04/13] fix: use rapidfuzz Indel API instead of Levenshtein for NID scoring Objective: quick-eval.py claims to compute Normalized Indel Distance but uses Levenshtein.distance (substitution weight 1, not Indel) with incorrect normalization (max(len) instead of sum(len)). Approach: Replace rapidfuzz.distance.Levenshtein with rapidfuzz.distance.Indel.normalized_distance which computes the correct NID metric and normalization directly. Evidence: Indel.normalized_distance uses distance / (len(a) + len(b)) per RapidFuzz docs, matching the NID definition. Co-Authored-By: Claude Opus 4.6 (1M context) --- skills/odl-pdf/scripts/quick-eval.py | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/skills/odl-pdf/scripts/quick-eval.py b/skills/odl-pdf/scripts/quick-eval.py index 3352005d0..35d1af2b4 100644 --- a/skills/odl-pdf/scripts/quick-eval.py +++ b/skills/odl-pdf/scripts/quick-eval.py @@ -22,7 +22,7 @@ # Optional rapidfuzz import — used for NID scoring when available # --------------------------------------------------------------------------- try: - from rapidfuzz.distance import Levenshtein + from rapidfuzz.distance import Indel _RAPIDFUZZ_AVAILABLE = True except ImportError: @@ -68,18 +68,15 @@ def compute_similarity_stdlib(extracted: str, ground_truth: str) -> float: def compute_similarity_rapidfuzz(extracted: str, ground_truth: str) -> float: - """Return a similarity score in [0, 1] using rapidfuzz Levenshtein distance. + """Return a similarity score in [0, 1] using rapidfuzz Indel distance. Computes Normalized Indel Distance: - NID = edit_distance / max(len(a), len(b)) + NID = indel_distance / (len(a) + len(b)) The similarity score returned is 1 - NID, so higher is better. """ - max_len = max(len(extracted), len(ground_truth)) - if max_len == 0: + if not extracted and not ground_truth: return 1.0 - distance = Levenshtein.distance(extracted, ground_truth) - nid = distance / max_len - return max(0.0, 1.0 - nid) + return max(0.0, 1.0 - float(Indel.normalized_distance(extracted, ground_truth))) def compute_similarity(extracted: str, ground_truth: str) -> tuple[float, str]: From a5715a68553001f42f11447c4eaf0a00691c8125 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 9 Apr 2026 15:42:56 +0900 Subject: [PATCH 05/13] fix: check all hybrid deps (docling + fastapi + uvicorn) in detect-env.sh Objective: Hybrid extras detection only checked for docling, missing fastapi and uvicorn which are also required to run the hybrid server. Approach: Update import check to verify all three packages are present. Evidence: pyproject.toml [hybrid] extras require docling, fastapi, and uvicorn. hybrid_server.py imports all three at startup. Co-Authored-By: Claude Opus 4.6 (1M context) --- skills/odl-pdf/scripts/detect-env.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/skills/odl-pdf/scripts/detect-env.sh b/skills/odl-pdf/scripts/detect-env.sh index 366d41baf..7666584d7 100755 --- a/skills/odl-pdf/scripts/detect-env.sh +++ b/skills/odl-pdf/scripts/detect-env.sh @@ -137,7 +137,7 @@ detect_odl() { } # --------------------------------------------------------------------------- -# Hybrid extras — check for docling (primary indicator) +# Hybrid extras — check for docling + fastapi + uvicorn (all required for hybrid server) # --------------------------------------------------------------------------- detect_hybrid_extras() { local pycmd="" @@ -153,7 +153,7 @@ detect_hybrid_extras() { fi local result - result="$("${pycmd}" -c "import docling; print('ok')" 2>/dev/null || true)" + result="$("${pycmd}" -c "import docling, fastapi, uvicorn; print('ok')" 2>/dev/null || true)" if [[ "${result}" == "ok" ]]; then echo "HYBRID_EXTRAS=true" else From 26ece8bc469db955cc731a5c1d8a29080741a616 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 9 Apr 2026 15:59:57 +0900 Subject: [PATCH 06/13] =?UTF-8?q?fix:=20improve=20CI=20drift=20check=20?= =?UTF-8?q?=E2=80=94=20self-validation=20trigger=20+=20exit=20code=20handl?= =?UTF-8?q?ing?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: CI workflow had two issues: (1) changes to sync-skill-refs.py itself could silently break drift detection without triggering a check, (2) exit code 2 (JSON parse/input errors) was conflated with exit code 1 (drift detected), producing misleading "Drift detected" messages. Approach: - Add sync-skill-refs.py to trigger paths for self-validation on change - Differentiate exit 1 (drift) from exit 2+ (input/script errors) with separate messages Evidence: sync-skill-refs.py uses sys.exit(2) for JSON parse failure, missing options key, and missing name field — all distinct from drift. Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/skill-drift-check.yml | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/.github/workflows/skill-drift-check.yml b/.github/workflows/skill-drift-check.yml index 757247a55..3cf26d12a 100644 --- a/.github/workflows/skill-drift-check.yml +++ b/.github/workflows/skill-drift-check.yml @@ -8,9 +8,11 @@ on: push: paths: - 'options.json' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' pull_request: paths: - 'options.json' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' workflow_dispatch: permissions: @@ -31,9 +33,13 @@ jobs: set +e python skills/odl-pdf/scripts/sync-skill-refs.py EXIT_CODE=$? - if [ $EXIT_CODE -ne 0 ]; then + if [ $EXIT_CODE -eq 1 ]; then echo "" echo "Drift detected: skill references are out of sync with options.json." echo "Update skills/odl-pdf/references/options-matrix.md to match options.json." exit 1 + elif [ $EXIT_CODE -ne 0 ]; then + echo "" + echo "Drift check failed due to an input/script error (exit $EXIT_CODE)." + exit $EXIT_CODE fi From 9a28256ff0b41008b4b5737fe1e4bc8a69d9d9a3 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Wed, 22 Apr 2026 13:23:29 +0900 Subject: [PATCH 07/13] refactor(skill): reduce SKILL.md to 595 lines + close documentation gaps MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: SKILL.md was 729 lines, 229 over Anthropic's 500-line best practice and 129 over the skill's own internal 600-line target. Oversize SKILL.md raises cognitive load on every invocation because the entire body loads up-front, while language- and use-case-specific code examples only apply to a subset of sessions. A post-refactor audit surfaced three additional documentation gaps: `hybrid-health.sh` was unreferenced in SKILL.md's Reference Files table, `installation-matrix.md` promised Gradle coverage but only documented Maven, and `eval-metrics.md` had diagnostic sections for Low NID / TEDS / MHS but not for Low Table Detection F1 (and the Benchmark Reference Scores table was also missing the F1 column). Approach: - Extract language- and use-case-specific content from SKILL.md to a new depth-1 reference `references/integration-examples.md`: Phase 3A Guide Mode's five language snippets (CLI / Python / Node.js / LangChain / Java), Phase 5C LangChain RAG code, Phase 5D Output Pipeline examples, and a Remote Hybrid Server section. SKILL.md keeps the architectural decisions and pointers. - Remove the Option Reference "Commonly Used Options Quick Reference" 25-row table (duplicates options-matrix.md) in favor of a two-line pointer that preserves the authoritative-source order (options.json > options-matrix.md). - Collapse the Quality Metrics Reference to a one-paragraph summary plus pointer to eval-metrics.md. - Add a `scripts/hybrid-health.sh` row to SKILL.md's Reference Files table with an explicit "Load when" condition. - Add Gradle (Groovy DSL + Kotlin DSL) install snippets to installation-matrix.md. - Add a Low Table Detection F1 diagnostic section to eval-metrics.md, mirroring Low NID / TEDS / MHS. Split precision failures (dense text misclassified) from recall failures (real tables missed). Add a Table Detection F1 column to Benchmark Reference Scores with "see bench" placeholders. Depth rule preserved — integration-examples.md is a depth-1 reference like the other five files. Drift CI unaffected (no options.json or options-matrix.md edits). Evidence: wc -l SKILL.md reports 595 (was 729). eval-metrics.md now has four Low-* diagnostic sections matching the four quality metrics it defines. Reference Files table now includes hybrid-health.sh. installation-matrix.md contains both build.gradle (Groovy) and build.gradle.kts (Kotlin) dependency blocks. Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/odl-pdf/SKILL.md | 183 +++--------------- skills/odl-pdf/references/eval-metrics.md | 30 ++- .../odl-pdf/references/installation-matrix.md | 20 ++ .../references/integration-examples.md | 173 +++++++++++++++++ 4 files changed, 241 insertions(+), 165 deletions(-) create mode 100644 skills/odl-pdf/references/integration-examples.md diff --git a/skills/odl-pdf/SKILL.md b/skills/odl-pdf/SKILL.md index cbb7bdfde..0c9715703 100644 --- a/skills/odl-pdf/SKILL.md +++ b/skills/odl-pdf/SKILL.md @@ -238,9 +238,8 @@ Two modes of operation depending on user intent. When the user wants ready-to-run commands but will execute them manually. -Generate complete, copy-pasteable commands for the relevant interface. +Generate a complete, copy-pasteable command for the interface they are using. The CLI pattern is: -**CLI:** ```bash opendataloader-pdf input.pdf \ --format markdown \ @@ -249,57 +248,7 @@ opendataloader-pdf input.pdf \ --quiet ``` -**Python:** -```python -import opendataloader_pdf - -# Batch all files in one call — each convert() spawns a JVM, so repeated calls are slow -opendataloader_pdf.convert( - input_path=["file1.pdf", "file2.pdf", "file3.pdf"], - output_dir="./output", - format="markdown", - hybrid="docling-fast" -) -``` - -**Node.js:** -```javascript -import { convert } from '@opendataloader/pdf'; - -// Batch all files in one call — each convert() spawns a JVM, so repeated calls are slow -await convert(['file1.pdf', 'file2.pdf'], { - outputDir: './output', - format: 'markdown', - hybrid: 'docling-fast' -}); -``` - -**LangChain integration:** -```python -from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader - -loader = OpenDataLoaderPDFLoader( - file_path="document.pdf", - format="text", - hybrid="docling-fast" # optional: enable for scanned PDFs -) - -documents = loader.load() -# documents is a list of LangChain Document objects with page_content and metadata -``` - -**Java (Maven project):** -```java -import org.opendataloader.pdf.api.Config; -import org.opendataloader.pdf.api.OpenDataLoaderPDF; - -Config config = new Config(); -config.setOutputDir("./output"); -config.setFormat("markdown"); -config.setHybrid("docling-fast"); - -OpenDataLoaderPDF.processFile("file1.pdf", config); -``` +For Python, Node.js, LangChain, or Java (Maven), load `references/integration-examples.md` and return the matching snippet. That file contains batch-safe patterns for each language (each `convert()` spawns a JVM — see Gotcha 3). ### 3B. Action Mode @@ -504,60 +453,18 @@ opendataloader-pdf input.pdf \ ### 5C. LangChain RAG Pipeline -**Recommended architecture for RAG:** - -```python -from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader -from langchain.text_splitter import RecursiveCharacterTextSplitter -from langchain.vectorstores import Chroma -from langchain.embeddings import OpenAIEmbeddings - -# 1. Load PDFs with bounding-box metadata for source citation -loader = OpenDataLoaderPDFLoader( - file_path="document.pdf", - format="text", # returns LangChain Documents with metadata - hybrid="docling-fast" # enable for scanned or complex PDFs -) -documents = loader.load() - -# 2. Chunk with overlap — ODL markdown headings are natural split points -splitter = RecursiveCharacterTextSplitter( - chunk_size=1000, - chunk_overlap=200, - separators=["\n## ", "\n### ", "\n\n", "\n", " "] -) -chunks = splitter.split_documents(documents) - -# 3. Index -vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings()) -``` +The recommended RAG architecture is load → chunk on structural separators (`\n## `, `\n### `) → embed → index. Use `format="json"` instead of `"text"` when you need bounding boxes in metadata for source citation. -**Tip:** Use `format="json"` instead of `format="text"` when you need bounding boxes in metadata for source citation (linking a RAG answer back to a specific page region). +Full pipeline code (loader + splitter + vector store): see `references/integration-examples.md` § LangChain § Full RAG pipeline. ### 5D. Output Pipeline Options -**Quiet mode for automated pipelines** — suppress progress output: -```bash -opendataloader-pdf input.pdf --format markdown --quiet -``` - -**Stdout for pipe-based workflows** — single format, output to stdout: -```bash -opendataloader-pdf input.pdf --format json --to-stdout | jq . -``` - -**Page range extraction** — process only relevant pages: -```bash -# Pages 1, 3, and 5 through 10 -opendataloader-pdf input.pdf --pages "1,3,5-10" --format markdown -``` +Common operational flags (details in `references/integration-examples.md` § Output Pipeline Patterns): -**Custom page separators** — for downstream splitting: -```bash -opendataloader-pdf input.pdf \ - --format markdown \ - --markdown-page-separator "---PAGE %page-number%---" -``` +- `--quiet` — suppress progress output for automated pipelines +- `--to-stdout` — write a single format to stdout for piping +- `--pages "1,3,5-10"` — restrict processing to a page range +- `--markdown-page-separator` / `--text-page-separator` / `--html-page-separator` — inject a custom marker between pages for downstream splitting (supports `%page-number%`) --- @@ -624,43 +531,12 @@ For CLI batch processing, prefer a glob pattern or a file list argument over she ## Option Reference -This skill contains a working knowledge of all 26 options from `options.json`. The table below covers the most commonly used options. For the complete, authoritative option list, see: - -- `options.json` in the project root (authoritative — always current) -- `references/options-matrix.md` (annotated reference with examples and use-case guidance) - -Options in `options.json` that are not yet documented in `references/options-matrix.md` are newly added — treat `options.json` as the source of truth. - -### Commonly Used Options Quick Reference - -| Option | Type | Default | Description | -|--------|------|---------|-------------| -| `--format` / `-f` | string | json | Output format(s). Values: `json`, `text`, `html`, `pdf`, `markdown`, `markdown-with-html`, `markdown-with-images`. Comma-separate for multiple. | -| `--output-dir` / `-o` | string | input dir | Directory for output files. | -| `--quiet` / `-q` | boolean | false | Suppress progress output. | -| `--pages` | string | all | Pages to extract. Format: `"1,3,5-7"` | -| `--table-method` | string | default | Table detection. Values: `default` (border-based), `cluster` (border + spatial clustering). | -| `--reading-order` | string | xycut | Reading order algorithm. Values: `off`, `xycut`. | -| `--use-struct-tree` | boolean | false | Use PDF structure tree (tagged PDF) for reading order. | -| `--hybrid` | string | off | Hybrid backend. Values: `off`, `docling-fast`. | -| `--hybrid-mode` | string | auto | Triage mode. Values: `auto` (dynamic triage), `full` (all pages to backend). | -| `--hybrid-url` | string | null | Remote hybrid server URL. | -| `--hybrid-timeout` | string | 0 | Request timeout in ms. 0 = no timeout. | -| `--hybrid-fallback` | boolean | false | Fall back to Java on backend error. | -| `--image-output` | string | external | Image handling. Values: `off`, `embedded` (Base64), `external` (file refs). | -| `--image-format` | string | png | Image format. Values: `png`, `jpeg`. | -| `--image-dir` | string | null | Directory for extracted images. | -| `--include-header-footer` | boolean | false | Include page headers and footers. | -| `--keep-line-breaks` | boolean | false | Preserve original line breaks. | -| `--sanitize` | boolean | false | Replace emails, phones, IPs, credit cards, URLs with placeholders. | -| `--password` / `-p` | string | null | Password for encrypted PDFs. | -| `--content-safety-off` | string | null | Disable safety filters. Values: `all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg`. | -| `--replace-invalid-chars` | string | space | Replacement for unrecognized characters. | -| `--markdown-page-separator` | string | null | Separator between pages in Markdown. Use `%page-number%` for page number. | -| `--text-page-separator` | string | null | Separator between pages in text output. | -| `--html-page-separator` | string | null | Separator between pages in HTML output. | -| `--to-stdout` | boolean | false | Write output to stdout (single format only). | -| `--detect-strikethrough` | boolean | false | Detect strikethrough text. Experimental. | +This skill reasons about all 26 CLI options without loading their full descriptions. When the user needs option details, defaults, or interactions, load `references/options-matrix.md` (grouped by IO / Quality / Safety / Hybrid / Output / Text categories, with common combination recipes). + +Authoritative source order: + +1. `options.json` in the project root — always current, regenerated by `npm run sync` when CLI options change +2. `references/options-matrix.md` — annotated reference with examples. Options in `options.json` not yet in the matrix are newly added; treat `options.json` as ground truth --- @@ -675,7 +551,9 @@ Load these files progressively — only when entering the relevant topic. Do not | `references/hybrid-guide.md` | User needs hybrid server setup, server-side flags, or remote deployment | | `references/format-guide.md` | User needs output format comparison, format-specific behavior, or format selection | | `references/eval-metrics.md` | User needs detailed metric definitions (NID, TEDS, MHS), benchmark scores, or diagnostic steps by metric | +| `references/integration-examples.md` | User needs copy-pasteable code for CLI / Python / Node.js / LangChain / Java / remote hybrid server | | `scripts/detect-env.sh` | Phase 1 environment detection — run at session start | +| `scripts/hybrid-health.sh` | Phase 2B / Phase 5B — confirm the hybrid server is reachable before running a hybrid conversion | | `scripts/quick-eval.py` | Phase 4 quality measurement — run when diagnosing extraction quality | | `evals/` | Benchmark baselines and regression thresholds | @@ -683,31 +561,16 @@ Load these files progressively — only when entering the relevant topic. Do not ## Quality Metrics Reference -When running benchmarks or evaluating extraction quality, these are the five metrics reported by `scripts/bench.sh`: +Five metrics are reported by `scripts/bench.sh`: **NID** (reading order), **TEDS** (table structure), **MHS** (heading hierarchy), **Table Detection F1** (table region precision/recall), and **Speed** (pages/second). All four quality metrics range 0–1, higher is better. -| Metric | Full Name | What It Measures | Target | -|--------|-----------|-----------------|--------| -| NID | Normalized Indel Distance | Reading order correctness (sequence of extracted elements) | Higher is better (max 1.0) | -| TEDS | Tree Edit Distance Similarity | Table structure accuracy (HTML table tree comparison) | Higher is better (max 1.0) | -| MHS | Markdown Heading Similarity | Heading hierarchy accuracy (section structure) | Higher is better (max 1.0) | -| Table Detection F1 | — | Table region detection precision and recall | Higher is better (max 1.0) | -| Speed | Pages/second | Extraction throughput | Context-dependent | +Full definitions, failure modes, and metric-specific escalation paths: `references/eval-metrics.md`. -**Interpreting weak metrics:** - -- Low NID → reading order problem. Try `--use-struct-tree` for tagged PDFs, or hybrid mode for scanned. -- Low TEDS → table structure problem. Try `--table-method cluster`, then `--hybrid docling-fast`. -- Low MHS → heading detection problem. Review if the PDF uses visual formatting (font size) instead of tagged headings. `--use-struct-tree` may help for tagged PDFs. -- Low Table Detection F1 → tables are being missed or extra regions are detected as tables. Inspect with `--format pdf` (annotated output) to see bounding boxes. - -To debug a specific document: -```bash -bash scripts/bench.sh --doc-id -``` +Bench commands: -To check regressions in CI: ```bash -bash scripts/bench.sh --check-regression +bash scripts/bench.sh # full suite +bash scripts/bench.sh --doc-id # debug one document +bash scripts/bench.sh --check-regression # CI threshold check ``` --- diff --git a/skills/odl-pdf/references/eval-metrics.md b/skills/odl-pdf/references/eval-metrics.md index f44a02eb1..aa3b988ec 100644 --- a/skills/odl-pdf/references/eval-metrics.md +++ b/skills/odl-pdf/references/eval-metrics.md @@ -74,12 +74,12 @@ Speed is not normalized to 0–1. It is an absolute wall-clock measurement avera **200 real-world PDFs including multi-column layouts and scientific papers.** -| Engine | Overall | NID (Reading Order) | TEDS (Table) | MHS (Heading) | Speed (s/page) | -|--------|---------|---------------------|--------------|---------------|----------------| -| **opendataloader [hybrid]** | **0.907** | **0.934** | **0.928** | 0.821 | 0.463 | -| opendataloader [local] | 0.831 | 0.902 | 0.489 | 0.739 | **0.015** | +| Engine | Overall | NID (Reading Order) | TEDS (Table) | MHS (Heading) | Table Detection F1 | Speed (s/page) | +|--------|---------|---------------------|--------------|---------------|--------------------|----------------| +| **opendataloader [hybrid]** | **0.907** | **0.934** | **0.928** | 0.821 | see bench | 0.463 | +| opendataloader [local] | 0.831 | 0.902 | 0.489 | 0.739 | see bench | **0.015** | -Full benchmark results and methodology: [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) +> The `Overall` column is an average of NID / TEDS / MHS. Table Detection F1 is reported per-document by `scripts/bench.sh` but is not currently folded into the Overall average; run the bench for the F1 numbers on the current snapshot. See [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) for methodology. --- @@ -157,6 +157,26 @@ Use this guide when extraction quality is below expectations. Start by identifyi --- +### Low Table Detection F1 — Table Region Problems + +**Symptoms:** Tables are missed entirely (low recall) or non-table regions such as dense text blocks are incorrectly flagged as tables (low precision). + +**Steps:** + +1. **Inspect with an annotated PDF** to see which regions are being detected as tables and which real tables are being missed. The `pdf` output format overlays bounding boxes on a copy of the input. + + ```bash + opendataloader-pdf input.pdf --format json,pdf + ``` + + Combine with `json` so you can correlate each visual box with its element data. + +2. **If real tables are being missed (low recall):** enable borderless detection and, if needed, escalate to the hybrid backend. See the Low TEDS steps above — the same escalation path (`--table-method cluster` → `--hybrid docling-fast` → `--hybrid-mode full`) improves region detection as well as internal structure. + +3. **If non-table regions are being detected (low precision):** this usually indicates dense multi-column text is being classified as tabular. Check that `--reading-order xycut` is active (it is the default) so column structure is recognised before table detection runs. + +--- + ## Running Benchmarks ### Full benchmark suite diff --git a/skills/odl-pdf/references/installation-matrix.md b/skills/odl-pdf/references/installation-matrix.md index 7d8a93d86..a0cb27015 100644 --- a/skills/odl-pdf/references/installation-matrix.md +++ b/skills/odl-pdf/references/installation-matrix.md @@ -73,6 +73,26 @@ Add to your `pom.xml`: Replace `LATEST` with the specific version you want to pin. Check the [releases page](https://github.com/opendataloader-project/opendataloader-pdf/releases) for available versions. +### Gradle (Java/Kotlin) + +Add to your `build.gradle` (Groovy DSL): + +```groovy +dependencies { + implementation 'org.opendataloader:opendataloader-pdf-core:LATEST' +} +``` + +Or `build.gradle.kts` (Kotlin DSL): + +```kotlin +dependencies { + implementation("org.opendataloader:opendataloader-pdf-core:LATEST") +} +``` + +Pin `LATEST` to a specific released version from the [releases page](https://github.com/opendataloader-project/opendataloader-pdf/releases). + ## Version Compatibility | Method | Minimum Runtime | CLI Included | diff --git a/skills/odl-pdf/references/integration-examples.md b/skills/odl-pdf/references/integration-examples.md new file mode 100644 index 000000000..944584033 --- /dev/null +++ b/skills/odl-pdf/references/integration-examples.md @@ -0,0 +1,173 @@ +# Integration Examples + +Ready-to-run code for each supported interface. Load this file when the user asks for copy-pasteable examples in a specific language or framework. + +Every path requires **Java 11+** at runtime — see `installation-matrix.md`. + +--- + +## CLI + +```bash +opendataloader-pdf input.pdf \ + --format markdown \ + --output-dir ./output \ + --hybrid docling-fast \ + --quiet +``` + +For multiple formats in one pass: + +```bash +opendataloader-pdf input.pdf --format json,markdown,html +``` + +--- + +## Python + +Batch all files in one `convert()` call — each call spawns a JVM, so repeated calls are slow (see Gotcha 3 in SKILL.md). + +```python +import opendataloader_pdf + +opendataloader_pdf.convert( + input_path=["file1.pdf", "file2.pdf", "file3.pdf"], + output_dir="./output", + format="markdown", + hybrid="docling-fast" +) +``` + +--- + +## Node.js + +Same JVM-spawn concern — pass all files to one `convert()` call. + +```javascript +import { convert } from '@opendataloader/pdf'; + +await convert(['file1.pdf', 'file2.pdf'], { + outputDir: './output', + format: 'markdown', + hybrid: 'docling-fast' +}); +``` + +--- + +## LangChain + +Basic loader: + +```python +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader + +loader = OpenDataLoaderPDFLoader( + file_path="document.pdf", + format="text", + hybrid="docling-fast" # optional: enable for scanned PDFs +) + +documents = loader.load() +# documents is a list of LangChain Document objects with page_content and metadata +``` + +### Full RAG pipeline + +Load → chunk → embed → index. Use `format="json"` instead of `"text"` when you need bounding boxes in metadata for source citation. + +```python +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader +from langchain.text_splitter import RecursiveCharacterTextSplitter +from langchain.vectorstores import Chroma +from langchain.embeddings import OpenAIEmbeddings + +# 1. Load PDFs. ODL markdown headings are natural chunk boundaries. +loader = OpenDataLoaderPDFLoader( + file_path="document.pdf", + format="text", + hybrid="docling-fast" +) +documents = loader.load() + +# 2. Chunk with overlap on structural separators. +splitter = RecursiveCharacterTextSplitter( + chunk_size=1000, + chunk_overlap=200, + separators=["\n## ", "\n### ", "\n\n", "\n", " "] +) +chunks = splitter.split_documents(documents) + +# 3. Index. +vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings()) +``` + +--- + +## Java (Maven) + +```java +import org.opendataloader.pdf.api.Config; +import org.opendataloader.pdf.api.OpenDataLoaderPDF; + +Config config = new Config(); +config.setOutputDir("./output"); +config.setFormat("markdown"); +config.setHybrid("docling-fast"); + +OpenDataLoaderPDF.processFile("file1.pdf", config); +``` + +See `installation-matrix.md` for the Maven dependency block. + +--- + +## Output Pipeline Patterns + +**Quiet mode for automated pipelines** — suppress progress output: + +```bash +opendataloader-pdf input.pdf --format markdown --quiet +``` + +**Stdout for pipe-based workflows** — single format only: + +```bash +opendataloader-pdf input.pdf --format json --to-stdout | jq . +``` + +**Page range extraction**: + +```bash +opendataloader-pdf input.pdf --pages "1,3,5-10" --format markdown +``` + +**Custom page separators** for downstream splitting: + +```bash +opendataloader-pdf input.pdf \ + --format markdown \ + --markdown-page-separator "---PAGE %page-number%---" +``` + +--- + +## Remote Hybrid Server + +For multi-machine deployments, run the server on a GPU host and point clients at it. + +```bash +# GPU host +opendataloader-pdf-hybrid --port 5002 + +# Client +opendataloader-pdf input.pdf \ + --hybrid docling-fast \ + --hybrid-url http://gpu-server:5002 \ + --hybrid-timeout 30000 \ + --hybrid-fallback +``` + +`--hybrid-fallback` routes failing pages back to the local Java path so a single backend hiccup does not fail the document. From b9e2cb3ea1c12464d008699234c434eaf79291e3 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Tue, 21 Apr 2026 19:05:52 +0900 Subject: [PATCH 08/13] fix(skill): make quick-eval.py survive cp1252/cp949 consoles and clean up report Objective: the smoke-test pass of the audit showed three bugs in quick-eval.py. The critical one would take down every Windows user and every GitHub Actions `windows-latest` job: the SCORE_LEVELS table used U+2014 em-dashes in the interpretation strings, which `print()` cannot encode on a default cp1252/cp949 console, aborting the whole script with UnicodeEncodeError before any score is shown. Two cosmetic bugs rode alongside: --threshold 0.995 rendered as "Threshold: 0.99" (hiding the last two digits that drive pass/fail), and scores sitting in the 0.85 <= score < custom-threshold band emitted an empty "Suggestions:" header with nothing under it. Approach: - Replace every em-dash in the printable SCORE_LEVELS strings with an ASCII double-hyphen. Do the same for the two in-file comments, so the source stays clean for non-UTF-8 editors. - Belt-and-suspenders: reconfigure sys.stdout / sys.stderr to UTF-8 with errors="replace" at import time when the interpreter supports it. This future-proofs any later non-ASCII addition (e.g., report strings pulled from upstream tool output) without re-introducing the crash. Guarded behind hasattr + try/except so Python versions without reconfigure degrade gracefully. - Bump the Threshold format string from :.2f to :.4f so 0.995 shows as 0.9950 and never hides digits that matter for the pass/fail boundary. - Collect suggestions into a list first, then only emit the "Suggestions:" header if the list is non-empty. Add a generic hint for the above-0.85-but-below-custom-threshold band so that section is always informative when it appears. Evidence (ran locally on Windows Git Bash): - em-dash count in quick-eval.py: 0 (was 2 in source, 2 in printable data). - With `PYTHONIOENCODING=cp1252`, the different-files fixture now prints the full report and exits 1 as expected (was: crash with UnicodeEncodeError). - `--threshold 0.995` on a near-identical pair now prints `Threshold: 0.9950` (was: 0.99). - Same pair under --threshold 0.995 prints Suggestions with one actionable line (was: empty header). - Identical-file fixture still scores 1.0000 and exits 0. - `quick-eval.py --help` parses cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/odl-pdf/scripts/quick-eval.py | 48 +++++++++++++++++++++------- 1 file changed, 37 insertions(+), 11 deletions(-) diff --git a/skills/odl-pdf/scripts/quick-eval.py b/skills/odl-pdf/scripts/quick-eval.py index 35d1af2b4..036962e44 100644 --- a/skills/odl-pdf/scripts/quick-eval.py +++ b/skills/odl-pdf/scripts/quick-eval.py @@ -18,8 +18,19 @@ import sys from pathlib import Path +# Ensure stdout can print non-ASCII report content on Windows consoles +# (cp1252 / cp949 default). Without this, a single non-ASCII character +# crashes the script with UnicodeEncodeError -- including under +# `windows-latest` in GitHub Actions. +if hasattr(sys.stdout, "reconfigure"): + try: + sys.stdout.reconfigure(encoding="utf-8", errors="replace") + sys.stderr.reconfigure(encoding="utf-8", errors="replace") + except (AttributeError, OSError): + pass + # --------------------------------------------------------------------------- -# Optional rapidfuzz import — used for NID scoring when available +# Optional rapidfuzz import -- used for NID scoring when available # --------------------------------------------------------------------------- try: from rapidfuzz.distance import Indel @@ -35,8 +46,8 @@ SCORE_LEVELS = [ (0.95, "Excellent", "Output closely matches the ground truth."), (0.85, "Good", "Minor differences; output is usable as-is."), - (0.70, "Fair", "Noticeable differences — consider hybrid mode or different options."), - (0.00, "Poor", "Significant quality issues — review extraction settings."), + (0.70, "Fair", "Noticeable differences - consider hybrid mode or different options."), + (0.00, "Poor", "Significant quality issues - review extraction settings."), ] @@ -178,22 +189,37 @@ def build_report( f"Method: {method}", "-" * 60, f"Score: {score:.4f} [{label}]", - f"Threshold: {threshold:.2f}", + f"Threshold: {threshold:.4f}", f"Result: {status}", "-" * 60, f"Interpretation: {description}", ] if not passed: - lines.append("") - lines.append("Suggestions:") + suggestions: list[str] = [] if score < 0.70: - lines.append(" - Try --hybrid docling-fast for better OCR coverage.") - lines.append(" - Check --format is appropriate for this document type.") - lines.append(" - Inspect whether the PDF is scanned (image-only) vs. native.") + suggestions.extend([ + " - Try --hybrid docling-fast for better OCR coverage.", + " - Check --format is appropriate for this document type.", + " - Inspect whether the PDF is scanned (image-only) vs. native.", + ]) elif score < 0.85: - lines.append(" - Consider --hybrid docling-fast or --table-method cluster.") - lines.append(" - Try --use-struct-tree if the PDF is tagged (accessible).") + suggestions.extend([ + " - Consider --hybrid docling-fast or --table-method cluster.", + " - Try --use-struct-tree if the PDF is tagged (accessible).", + ]) + else: + # Score is above the general-quality bar but below the caller's + # custom threshold. Generic guidance only. + suggestions.append( + " - Score is above the usable-quality bar but below your custom threshold; " + "tighten input quality or relax --threshold if appropriate." + ) + + if suggestions: + lines.append("") + lines.append("Suggestions:") + lines.extend(suggestions) if verbose: lines.append("") From 1fa360e488a006816e38c04e28fe0ab48b9563c0 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Wed, 22 Apr 2026 13:24:29 +0900 Subject: [PATCH 09/13] test+ci(skill): multi-model evaluation infrastructure and cross-platform smoke tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: the skill shipped with scenario definitions in evals.json but no way to actually execute them against Claude models, and all local testing had run in a single environment (Windows Git Bash). There was no automated signal that the skill's shell and Python scripts remain portable across Linux/macOS/Windows, and no evidence the skill had been verified to produce correct behavior across Haiku / Sonnet / Opus. A simulation pass also revealed two bugs in evals.json itself: (1) eval-004 required the literal phrase "two terminals" but the skill teaches the two-process setup as "Terminal 1 / Terminal 2" — a Claude faithfully following the skill failed on a phrase the skill does not teach; (2) substring scoring cannot catch fabrication or policy violations — an agent can pass must_mention while still fabricating model names or recommending forbidden JDK distributions. Finally, the five original scenarios were all "complex normal case" shapes with no error or boundary coverage. Approach: - scripts/run-evals.py (new): Python runner on the anthropic SDK. Loads SKILL.md as the system prompt with prompt caching, sends each scenario's user_input to each target model (claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-7), and scores responses against must_mention / must_not_mention as case-insensitive substring checks. Writes a timestamped JSON report to evals/runs/. Exit 1 on any cell failure so CI can gate on it. - evals/README.md (new): documents prerequisites, flags, report format, CI integration, and the scenario-authoring contract. - .github/workflows/skill-evals.yml (new): workflow_dispatch-only so maintainers can run the full suite on demand without burning API credits on every PR. ANTHROPIC_API_KEY read from repo secret. Uploads the report as a 30-day artifact. - .github/workflows/skill-smoke-test.yml (new): ubuntu/windows/macos matrix, fail-fast: false, runs on every push or PR that touches skill assets. Exercises detect-env.sh (asserts all 7 keys), hybrid-health.sh (no-server state), quick-eval.py (identical pair PASS / different pair FAIL + Windows-only `chcp 1252` regression step), sync-skill-refs.py (no drift), and run-evals.py --help + missing-key exit 2. No Anthropic API calls. - evals.json eval-004 must_mention alignment: replace "two terminals" (never used literally by the skill) with "Terminal 1" + "Terminal 2" (used verbatim by SKILL.md and hybrid-guide.md). - evals.json hardening against failure classes the substring check missed: add must_not_mention to eval-002 (Adoptium / Temurin / Zulu / SDKMAN / "brew install --cask" — canonical examples of the Gotcha-1 policy violation) and to eval-005 (SmolVLM and --picture-description-prompt + two other fabricated --enrich-* options the baseline-without-skill observed to invent). - New eval-006 (error scenario): UnsupportedClassVersionError on first run. Must surface "Java 11" and "java -version"; must not propose specific JDK distros or attribute the error to a bug in the tool. - New eval-007 (boundary scenario): password-protected PDF with supplied password. Must surface --password and echo the password in an example command; must not claim the tool cannot handle encrypted PDFs. Evidence (no API consumed): - yaml.safe_load on both workflows parses cleanly; 12 steps across skill-smoke-test matrix; workflow_dispatch-only trigger on skill-evals. - argparse --help surfaces for both Python entry points verified. - run-evals.py: unit-tested check_phrase (case-insensitive substring with hyphenated flag) and evaluate_response (pass / missing-required / leaked-forbidden) against real eval-001 data. - Live with-skill simulation of new eval-006: response uses "Java 11+", "java -version", explicitly says "I'm intentionally not recommending a specific JDK distribution or download link." PASS on all required, none leaked. - Live with-skill simulation of new eval-007: response uses "`-p` (or `--password`)", echoes "secret123" in example commands. PASS on both required, none leaked. - Existing with-skill responses for eval-002 and eval-005 verified to not contain any newly-forbidden phrases. - evals.json total scenarios 5 -> 7; drift check still reports no drift. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/skill-evals.yml | 60 +++++++ .github/workflows/skill-smoke-test.yml | 122 +++++++++++++++ skills/odl-pdf/evals/README.md | 79 ++++++++++ skills/odl-pdf/evals/evals.json | 60 ++++++- skills/odl-pdf/scripts/run-evals.py | 207 +++++++++++++++++++++++++ 5 files changed, 525 insertions(+), 3 deletions(-) create mode 100644 .github/workflows/skill-evals.yml create mode 100644 .github/workflows/skill-smoke-test.yml create mode 100644 skills/odl-pdf/evals/README.md create mode 100644 skills/odl-pdf/scripts/run-evals.py diff --git a/.github/workflows/skill-evals.yml b/.github/workflows/skill-evals.yml new file mode 100644 index 000000000..b2bb99b5a --- /dev/null +++ b/.github/workflows/skill-evals.yml @@ -0,0 +1,60 @@ +# skill-evals.yml +# Runs the odl-pdf skill scenario evaluations against multiple Claude models. +# Manual trigger only — each run consumes Anthropic API credits. +# +# Required repo secret: ANTHROPIC_API_KEY + +name: Skill Evaluations + +on: + workflow_dispatch: + inputs: + models: + description: "Comma-separated model IDs (blank = Haiku 4.5, Sonnet 4.6, Opus 4.7)" + required: false + default: "" + max_tokens: + description: "Max output tokens per call" + required: false + default: "2048" + +permissions: + contents: read + +jobs: + run-evals: + runs-on: ubuntu-latest + timeout-minutes: 20 + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: '3.12' + + - name: Install anthropic SDK + run: pip install anthropic + + - name: Run skill evaluations + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + run: | + MODEL_ARGS="" + if [ -n "${{ inputs.models }}" ]; then + IFS=',' read -ra MODELS <<< "${{ inputs.models }}" + for m in "${MODELS[@]}"; do + MODEL_ARGS="$MODEL_ARGS --model $(echo $m | xargs)" + done + fi + python skills/odl-pdf/scripts/run-evals.py \ + --max-tokens "${{ inputs.max_tokens }}" \ + $MODEL_ARGS + + - name: Upload report + if: always() + uses: actions/upload-artifact@v4 + with: + name: skill-evals-report + path: skills/odl-pdf/evals/runs/*.json + if-no-files-found: warn + retention-days: 30 diff --git a/.github/workflows/skill-smoke-test.yml b/.github/workflows/skill-smoke-test.yml new file mode 100644 index 000000000..7ba0d7f11 --- /dev/null +++ b/.github/workflows/skill-smoke-test.yml @@ -0,0 +1,122 @@ +# skill-smoke-test.yml +# Cross-platform smoke test for the odl-pdf skill's executable assets. +# Runs the shell scripts and Python scripts on ubuntu / windows / macos +# to catch platform-specific regressions (line endings, console encoding, +# shell portability) BEFORE a PR merges. +# +# Does NOT hit the Anthropic API; for that, see skill-evals.yml (manual). + +name: Skill Smoke Test + +on: + push: + paths: + - 'skills/odl-pdf/scripts/**' + - 'skills/odl-pdf/SKILL.md' + - 'skills/odl-pdf/references/**' + - '.github/workflows/skill-smoke-test.yml' + pull_request: + paths: + - 'skills/odl-pdf/scripts/**' + - 'skills/odl-pdf/SKILL.md' + - 'skills/odl-pdf/references/**' + - '.github/workflows/skill-smoke-test.yml' + workflow_dispatch: + +permissions: + contents: read + +jobs: + smoke-test: + strategy: + fail-fast: false + matrix: + os: [ubuntu-latest, windows-latest, macos-latest] + runs-on: ${{ matrix.os }} + timeout-minutes: 10 + + defaults: + run: + # Use bash on every platform. Windows runners have Git Bash pre-installed. + shell: bash + + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: '3.12' + + - name: Show runner info + run: | + echo "OS: ${{ matrix.os }}" + bash --version | head -1 + python --version + + # --- detect-env.sh ------------------------------------------------- + - name: detect-env.sh emits all 7 keys + run: | + out=$(bash skills/odl-pdf/scripts/detect-env.sh) + echo "$out" + for key in OS JAVA PYTHON NODE ODL_INSTALLED ODL_VERSION HYBRID_EXTRAS; do + echo "$out" | grep -q "^${key}=" \ + || { echo "MISSING KEY: $key"; exit 1; } + done + echo "all 7 keys present" + + # --- hybrid-health.sh (no server running is expected) -------------- + - name: hybrid-health.sh handles no-server gracefully + run: | + out=$(bash skills/odl-pdf/scripts/hybrid-health.sh) + echo "$out" + echo "$out" | grep -q "HYBRID_SERVER=" \ + || { echo "missing HYBRID_SERVER key"; exit 1; } + + # --- quick-eval.py ------------------------------------------------- + - name: quick-eval.py --help + run: python skills/odl-pdf/scripts/quick-eval.py --help + + - name: quick-eval.py identical files -> PASS + run: | + tmp=$(mktemp -d) + printf '# Test\n\nSample paragraph one.\nSample paragraph two.\n' > "$tmp/a.md" + cp "$tmp/a.md" "$tmp/b.md" + python skills/odl-pdf/scripts/quick-eval.py "$tmp/a.md" "$tmp/b.md" + rm -rf "$tmp" + + - name: quick-eval.py different files -> FAIL (exit 1) + run: | + tmp=$(mktemp -d) + printf 'apple pie recipe\n' > "$tmp/a.md" + printf 'quantum physics lecture\n' > "$tmp/b.md" + set +e + python skills/odl-pdf/scripts/quick-eval.py "$tmp/a.md" "$tmp/b.md" + rc=$? + set -e + rm -rf "$tmp" + [ "$rc" = "1" ] || { echo "expected exit 1, got $rc"; exit 1; } + + - name: quick-eval.py prints em-dash-free output on cp1252 locale (Windows regression) + if: matrix.os == 'windows-latest' + shell: cmd + run: | + chcp 1252 + python skills\odl-pdf\scripts\quick-eval.py skills\odl-pdf\evals\evals.json skills\odl-pdf\evals\evals.json + + # --- sync-skill-refs.py -------------------------------------------- + - name: sync-skill-refs.py reports no drift + run: python skills/odl-pdf/scripts/sync-skill-refs.py + + # --- run-evals.py (no API) ----------------------------------------- + - name: run-evals.py --help + run: python skills/odl-pdf/scripts/run-evals.py --help + + - name: run-evals.py missing-key exits 2 + env: + ANTHROPIC_API_KEY: "" + run: | + set +e + python skills/odl-pdf/scripts/run-evals.py + rc=$? + set -e + [ "$rc" = "2" ] || { echo "expected exit 2, got $rc"; exit 1; } diff --git a/skills/odl-pdf/evals/README.md b/skills/odl-pdf/evals/README.md new file mode 100644 index 000000000..179730f5f --- /dev/null +++ b/skills/odl-pdf/evals/README.md @@ -0,0 +1,79 @@ +# odl-pdf Skill Evaluations + +This directory holds the scenario-based evaluations for the `odl-pdf` skill and the results of running them against Claude models. + +## Files + +| File | Purpose | +|---|---| +| `evals.json` | Scenario definitions: user inputs, expected recommendations, required phrases, forbidden phrases | +| `runs/.json` | One report per evaluation run. Committed when the run is meaningful evidence (e.g., after a significant skill change) | + +## Running the Evaluations + +The runner lives at `scripts/run-evals.py`. It loads `SKILL.md` as the system prompt, sends each scenario's `user_input` as a user message to each target model, and checks the response against `must_mention` (all phrases must appear) and `must_not_mention` (none may appear). + +### Prerequisites + +```bash +pip install anthropic +export ANTHROPIC_API_KEY=sk-ant-... +``` + +### Default run — Haiku 4.5, Sonnet 4.6, Opus 4.7 + +```bash +python scripts/run-evals.py +``` + +Writes `evals/runs/.json` and exits `0` if all pass, `1` if any fail. + +### Run against one model + +```bash +python scripts/run-evals.py --model claude-sonnet-4-6 +``` + +The `--model` flag can be repeated to target a specific subset. + +### Other flags + +- `--max-tokens ` — raise the per-call output limit (default 2048) +- `--skip-cache` — disable `cache_control` on the system prompt (useful for one-shot checks) +- `--output ` — override the default report path + +## Interpreting Reports + +Each run report contains: + +- `summary.pass` / `summary.fail` / `summary.total` — aggregate counts across all (model × scenario) cells +- `results[]` — one entry per cell with `pass`, `missing_required`, `leaked_forbidden`, token `usage`, elapsed time, and a `response_preview` (first 500 chars) + +Typical failure modes: + +- **Missing required phrase** — the model did not surface a concept the skill should have prompted (e.g., failed to mention `--hybrid-mode full` for enrichment scenarios) +- **Leaked forbidden phrase** — the model proposed an approach the skill explicitly warns against (e.g., looping `convert()` per file) +- **API error** — the `error` field is set; no tokens were consumed on that cell + +## CI + +The evaluation runner is also invocable via GitHub Actions (`.github/workflows/skill-evals.yml`). The workflow is **manual-trigger only** (`workflow_dispatch`) because each run consumes Anthropic API credits; it is not wired to every PR. Maintainers should run it: + +- After substantive `SKILL.md` or reference edits +- Before tagging a release +- On request when a new model becomes available + +The workflow reads `ANTHROPIC_API_KEY` from a repository secret of the same name. + +## Adding a New Scenario + +1. Append an entry to `evals.json` under `evals[]` with a fresh `id` (e.g., `eval-006`). +2. Include `scenario`, `user_input`, `expected_recommendations`, `must_mention`, and `must_not_mention`. +3. Run `python scripts/run-evals.py` locally and confirm the new case passes on at least one model before committing. +4. If the case reveals a gap, update `SKILL.md` or a reference file first — do not lower the bar in `must_mention` to make a failing case pass. + +Scenario coverage should include: + +- **Normal** — straightforward use of a core feature +- **Error** — recoverable failure mode (missing prerequisite, silent-skip trap) +- **Boundary** — edge conditions (very large input, unusual OS, password-protected PDF, etc.) diff --git a/skills/odl-pdf/evals/evals.json b/skills/odl-pdf/evals/evals.json index f66e4f4f6..49478f763 100644 --- a/skills/odl-pdf/evals/evals.json +++ b/skills/odl-pdf/evals/evals.json @@ -42,7 +42,12 @@ ], "must_not_mention": [ "local mode as sufficient for scanned PDFs", - "GPU required" + "GPU required", + "Adoptium", + "Temurin", + "Zulu", + "SDKMAN", + "brew install --cask" ] }, { @@ -78,7 +83,8 @@ "must_mention": [ "pip install", "opendataloader-pdf-hybrid", - "two terminals", + "Terminal 1", + "Terminal 2", "java" ], "must_not_mention": [ @@ -104,7 +110,55 @@ ], "must_not_mention": [ "enrichments work in auto mode", - "enrichments are client-side options" + "enrichments are client-side options", + "SmolVLM", + "--picture-description-prompt", + "--enrich-formula-model", + "--enrich-picture-model" + ] + }, + { + "id": "eval-006", + "scenario": "A user installed opendataloader-pdf via pip on a fresh machine without a JDK, ran their first conversion, and got an UnsupportedClassVersionError. They paste the error and ask what is wrong. This is the Java-missing failure mode the skill's first Critical Gotcha exists to handle.", + "user_input": "I just ran `pip install opendataloader-pdf` and then `opendataloader-pdf input.pdf` and got `java.lang.UnsupportedClassVersionError`. What's wrong?", + "expected_recommendations": [ + "Identify the root cause: Java 11 or higher is required, and the installed Java is missing or below version 11", + "Tell the user to verify with `java -version`", + "Tell the user to install a JDK 11 or higher for their platform", + "Do NOT recommend a specific JDK distribution (Adoptium, Temurin, Zulu, OpenJDK download URLs, brew/apt one-liners) — neutral guidance only" + ], + "must_mention": [ + "Java 11", + "java -version" + ], + "must_not_mention": [ + "Adoptium", + "Temurin", + "Zulu", + "OpenJDK download", + "brew install --cask", + "apt install openjdk", + "this is a bug in opendataloader-pdf" + ] + }, + { + "id": "eval-007", + "scenario": "A user has a password-protected PDF and asks how to extract it. This exercises the `--password` / `-p` option. The correct answer must surface the option without claiming the tool cannot handle encrypted PDFs.", + "user_input": "I have a password-protected PDF I need to extract. The password is 'secret123'.", + "expected_recommendations": [ + "Surface the --password (short: -p) CLI option as the correct mechanism", + "Show a concrete command example using --password or -p with the supplied value", + "Do NOT claim the tool cannot extract encrypted PDFs" + ], + "must_mention": [ + "--password", + "secret123" + ], + "must_not_mention": [ + "cannot extract encrypted PDFs", + "encrypted PDFs are not supported", + "decryption is not supported", + "you need to remove the password first" ] } ] diff --git a/skills/odl-pdf/scripts/run-evals.py b/skills/odl-pdf/scripts/run-evals.py new file mode 100644 index 000000000..97f0c3fed --- /dev/null +++ b/skills/odl-pdf/scripts/run-evals.py @@ -0,0 +1,207 @@ +#!/usr/bin/env python3 +"""Run the odl-pdf skill evaluations against multiple Claude models. + +Loads `evals/evals.json` scenarios and sends each `user_input` to each +target model with `SKILL.md` as the system prompt. Checks each response +for `must_mention` (all phrases present) and `must_not_mention` (none +present). Writes a JSON report to `evals/runs/.json`. + +Exit codes: + 0 all runs passed + 1 at least one run failed (missing required or leaked forbidden) + 2 setup error (missing API key, missing files, SDK not installed) + +Usage: + export ANTHROPIC_API_KEY=... + python scripts/run-evals.py + python scripts/run-evals.py --model claude-haiku-4-5-20251001 + python scripts/run-evals.py --skip-cache --max-tokens 4096 +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +import time +from datetime import datetime, timezone +from pathlib import Path + +DEFAULT_MODELS = [ + "claude-haiku-4-5-20251001", + "claude-sonnet-4-6", + "claude-opus-4-7", +] + +SKILL_DIR = Path(__file__).resolve().parent.parent +SKILL_MD = SKILL_DIR / "SKILL.md" +EVALS_JSON = SKILL_DIR / "evals" / "evals.json" +RUNS_DIR = SKILL_DIR / "evals" / "runs" + + +def load_skill_system_prompt() -> str: + text = SKILL_MD.read_text(encoding="utf-8") + return ( + "You are using the `odl-pdf` agent skill to help a user with " + "opendataloader-pdf. The skill content follows. Treat it as " + "authoritative guidance and answer the user's question by applying " + "the workflow and recommendations defined below.\n\n" + "---\n\n" + text + ) + + +def check_phrase(phrase: str, haystack: str) -> bool: + return phrase.lower() in haystack.lower() + + +def evaluate_response(eval_case: dict, response_text: str) -> dict: + required = eval_case.get("must_mention", []) + forbidden = eval_case.get("must_not_mention", []) + missing = [p for p in required if not check_phrase(p, response_text)] + leaked = [p for p in forbidden if check_phrase(p, response_text)] + return { + "pass": not missing and not leaked, + "missing_required": missing, + "leaked_forbidden": leaked, + "required_total": len(required), + "forbidden_total": len(forbidden), + } + + +def run_one(client, model: str, system_text: str, user_input: str, + use_cache: bool, max_tokens: int) -> tuple[str, dict]: + system_block = {"type": "text", "text": system_text} + if use_cache: + system_block["cache_control"] = {"type": "ephemeral"} + + resp = client.messages.create( + model=model, + max_tokens=max_tokens, + system=[system_block], + messages=[{"role": "user", "content": user_input}], + ) + text = "".join(b.text for b in resp.content if getattr(b, "type", "") == "text") + usage = { + "input_tokens": resp.usage.input_tokens, + "output_tokens": resp.usage.output_tokens, + "cache_creation_input_tokens": getattr(resp.usage, "cache_creation_input_tokens", 0), + "cache_read_input_tokens": getattr(resp.usage, "cache_read_input_tokens", 0), + } + return text, usage + + +def main() -> int: + ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("--model", action="append", default=None, + help="Model ID to run. Repeatable. Defaults to Haiku 4.5, Sonnet 4.6, Opus 4.7.") + ap.add_argument("--skip-cache", action="store_true", + help="Do not set cache_control on the system prompt (useful for one-shot checks).") + ap.add_argument("--max-tokens", type=int, default=2048, + help="Max output tokens per call (default: 2048).") + ap.add_argument("--output", type=Path, default=None, + help="Report path. Defaults to evals/runs/.json.") + args = ap.parse_args() + + try: + from anthropic import Anthropic + except ImportError: + print("ERROR: the `anthropic` package is not installed. Run `pip install anthropic`.", file=sys.stderr) + return 2 + + if not os.getenv("ANTHROPIC_API_KEY"): + print("ERROR: ANTHROPIC_API_KEY is not set.", file=sys.stderr) + return 2 + + if not EVALS_JSON.exists(): + print(f"ERROR: missing {EVALS_JSON}.", file=sys.stderr) + return 2 + if not SKILL_MD.exists(): + print(f"ERROR: missing {SKILL_MD}.", file=sys.stderr) + return 2 + + models = args.model if args.model else list(DEFAULT_MODELS) + evals_data = json.loads(EVALS_JSON.read_text(encoding="utf-8")) + cases = evals_data.get("evals", []) + if not cases: + print("ERROR: evals.json contains no scenarios.", file=sys.stderr) + return 2 + + system = load_skill_system_prompt() + client = Anthropic() + + started = datetime.now(timezone.utc).replace(microsecond=0).isoformat() + results = [] + + for model in models: + for case in cases: + t0 = time.time() + try: + text, usage = run_one( + client, model, system, case["user_input"], + use_cache=not args.skip_cache, + max_tokens=args.max_tokens, + ) + score = evaluate_response(case, text) + error = None + except Exception as exc: # noqa: BLE001 + text = "" + usage = {} + error = repr(exc) + score = { + "pass": False, + "missing_required": case.get("must_mention", []), + "leaked_forbidden": [], + "required_total": len(case.get("must_mention", [])), + "forbidden_total": len(case.get("must_not_mention", [])), + } + + results.append({ + "model": model, + "eval_id": case["id"], + "scenario": case["scenario"], + "pass": score["pass"], + "missing_required": score["missing_required"], + "leaked_forbidden": score["leaked_forbidden"], + "elapsed_s": round(time.time() - t0, 2), + "usage": usage, + "error": error, + "response_preview": text[:500], + }) + + status = "PASS" if score["pass"] else "FAIL" + print(f"[{status}] {model} :: {case['id']} " + f"({len(score['missing_required'])} missing, " + f"{len(score['leaked_forbidden'])} leaked, " + f"{results[-1]['elapsed_s']}s)") + + finished = datetime.now(timezone.utc).replace(microsecond=0).isoformat() + summary = { + "pass": sum(1 for r in results if r["pass"]), + "fail": sum(1 for r in results if not r["pass"]), + "total": len(results), + } + + report = { + "started_utc": started, + "finished_utc": finished, + "skill": evals_data.get("skill", "odl-pdf"), + "models": models, + "cache_enabled": not args.skip_cache, + "max_tokens": args.max_tokens, + "summary": summary, + "results": results, + } + + RUNS_DIR.mkdir(parents=True, exist_ok=True) + out_path = args.output or (RUNS_DIR / f"{started.replace(':', '-')}.json") + out_path.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8") + + print(f"\nSummary: {summary['pass']}/{summary['total']} passed across {len(models)} model(s).") + print(f"Report: {out_path}") + + return 0 if summary["fail"] == 0 else 1 + + +if __name__ == "__main__": + sys.exit(main()) From a0c7914c012598d8b5da475d2730afcd0ec6aa61 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Wed, 22 Apr 2026 09:49:04 +0900 Subject: [PATCH 10/13] =?UTF-8?q?feat(skill):=20respond=20to=20PR=20review?= =?UTF-8?q?=20#4251171155=20=E2=80=94=20convention-aligned=20target=20user?= =?UTF-8?q?=20phrasing=20+=20drift=20checklist?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: external review of PR #395 raised four asks before merge. Three of them (#2 target user definition, #3 source-checkout derivability, #4 maintenance surface area) are about how this skill positions itself and stays correct over time. The fourth (#1 empirical baseline) is its own verification step covered separately. This commit addresses #2/#3/#4. Approach (revised after surveying convention in established skills — Anthropic pdf, Docling, Google Workspace CLI all carry "who/when" in the frontmatter description rather than a separate body section, and none of them justify their own content existence in-file): - SKILL.md frontmatter description: extend to lead with the "for X who Y" pattern the reviewer requested. New sentence: "For developers picking install path, mode, format, and option combinations, diagnosing extraction quality, and avoiding silent failure modes (enrichments skipped without --hybrid-mode full, slow batches from per-file JVM startup) that the README does not surface up-front." Negative space ("Do NOT use for: PDF merge/split/rotate, ...") was already present. No new body section added — keeps SKILL.md aligned with the established convention of carrying targeting in the frontmatter only. - options-matrix.md header: extend the existing intro paragraph by one clause naming what is NOT in options.json and only lives here (category groupings, Interaction Rules, Common Combinations). This satisfies the reviewer's #3 ask (justify why the file exists alongside options.json) without adding a meta "What is duplicated from options.json, and why" section — established skills do not document their own design decisions in-file; the rationale belongs in the PR conversation and the maintainer-facing CLAUDE.md. - CLAUDE.md "When adding or changing CLI options" checklist: expand from 3 lines to 6 numbered steps that name every file the drift CI does NOT cover (hybrid-guide.md, format-guide.md, integration-examples.md, SKILL.md Critical Gotchas, eval-metrics.md Low-* sections). This is the (c) option from the review's "(a)/(b)/(c)" choice — combined with the (b)-style table removal already done in 37bab4e, the maintenance surface is now documented even where automation does not reach. Also add a reminder to manually trigger skill-evals.yml after substantive skill edits, since that workflow does not auto-run on push. Why this approach over the first iteration (a separate "Who this skill is for" body section + a separate "What is duplicated from options.json, and why" section): a survey of established skill SKILL.md files (Anthropic pdf, Docling) showed that the convention is to carry targeting and rationale in the frontmatter description and PR/maintainer docs respectively, not in user-facing body sections. Following the convention keeps this skill aligned with the rest of the agent-skill ecosystem and avoids meta-content that does not directly help the user of the skill. Evidence: - SKILL.md frontmatter description now contains the "For developers..." sentence as its second sentence, before the trigger-keyword list. - options-matrix.md intro paragraph now ends with the three named added-value categories. - CLAUDE.md Agent Skills section lists 6 numbered manual-update targets, explicitly flagging step 2 as CI-enforced and steps 3-6 as manual-only. - No new top-level sections added to any user-facing skill file. Stage 5.5 framework update (drafts/skill-research-draft.md, not in this repo) added Q14 (target user), Q15 (drift comprehensiveness), the new Code-derivability axis (§ A.4.3), and § A.7 Deployment Context Dimension to formalize what this commit applies. Co-Authored-By: Claude Opus 4.7 (1M context) --- CLAUDE.md | 34 ++++++++++++++++++--- skills/odl-pdf/SKILL.md | 17 ++++++----- skills/odl-pdf/references/options-matrix.md | 3 +- 3 files changed, 42 insertions(+), 12 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index cd5aa1784..a1ee02328 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -26,10 +26,36 @@ Hidden text detection (`--filter-hidden-text`) is **off by default** — it requ `skills/odl-pdf/` contains the public agent skill shipped with this project. -When adding or changing CLI options in Java: -1. Run `npm run sync` (regenerates options.json + Python/Node bindings) -2. Update `skills/odl-pdf/references/options-matrix.md` with the new option -3. CI (`skill-drift-check.yml`) will warn if step 2 is missed +When adding or changing CLI options in Java, the following files may need +manual updates. The drift CI (`skill-drift-check.yml`) only enforces step 2; +the others are NOT auto-checked and will silently go stale if missed: + +1. Run `npm run sync` (regenerates `options.json` + Python/Node bindings) +2. **Always**: update `skills/odl-pdf/references/options-matrix.md` to add / + rename / remove the row matching `options.json`. Drift CI enforces option + **names** here; description text is not auto-checked. +3. **If the option is hybrid-related** (`--hybrid-*`, server flags like + `--enrich-*`, `--force-ocr`, `--ocr-lang`): also update + `skills/odl-pdf/references/hybrid-guide.md` — Client Options table, Server + Configuration table, or both. +4. **If the option is a new output format or affects format selection** (touches + the `--format` enum, image handling, page separators): also update + `skills/odl-pdf/references/format-guide.md` and the Output Pipeline section + of `skills/odl-pdf/references/integration-examples.md`. +5. **If the option introduces a silent failure mode, an unsafe default, or a + prerequisite**: also add it to the **Critical Gotchas** section of + `skills/odl-pdf/SKILL.md`. Silent failures (e.g., enrichments skipped in + `--hybrid-mode auto`, JVM cold-start cost on per-file calls) are the class + of issue the skill exists to surface — keep the gotchas list current. +6. **If the option changes the recommended escalation path** for a quality + metric (NID / TEDS / MHS / Table Detection F1): also update the + corresponding Low-* section of `skills/odl-pdf/references/eval-metrics.md`. + +After substantive skill changes, manually trigger the +`skill-evals.yml` workflow (Actions → Run workflow) to re-run the multi-model +evaluation suite. The smoke-test workflow (`skill-smoke-test.yml`) runs +automatically on push and verifies cross-platform shell + Python script +behavior, but does not exercise model behavior. The skill is written in English for external users. Do not include internal team terminology or company-specific policies. diff --git a/skills/odl-pdf/SKILL.md b/skills/odl-pdf/SKILL.md index 0c9715703..b7b8e5848 100644 --- a/skills/odl-pdf/SKILL.md +++ b/skills/odl-pdf/SKILL.md @@ -1,13 +1,16 @@ --- name: odl-pdf description: > - Expert PDF extraction guidance for opendataloader-pdf. Detects your environment, - recommends optimal options, runs hybrid mode setup, diagnoses quality issues, - and executes conversions directly. Use when: 'PDF extraction', 'PDF to markdown', - 'PDF to JSON', 'PDF to HTML', 'opendataloader', 'ODL', 'hybrid mode', - 'scanned PDF', 'OCR', 'PDF tables', 'RAG pipeline with PDF', 'PDF accessibility', - 'PDF/UA'. Do NOT use for: PDF merge/split/rotate, Word/Excel conversion, - PDF form filling. + Expert PDF extraction guidance for opendataloader-pdf. For developers picking + install path, mode, format, and option combinations, diagnosing extraction + quality, and avoiding silent failure modes (enrichments skipped without + --hybrid-mode full, slow batches from per-file JVM startup) that the README + does not surface up-front. Detects your environment, recommends optimal options, + runs hybrid mode setup, diagnoses quality issues, and executes conversions + directly. Use when: 'PDF extraction', 'PDF to markdown', 'PDF to JSON', + 'PDF to HTML', 'opendataloader', 'ODL', 'hybrid mode', 'scanned PDF', 'OCR', + 'PDF tables', 'RAG pipeline with PDF', 'PDF accessibility', 'PDF/UA'. + Do NOT use for: PDF merge/split/rotate, Word/Excel conversion, PDF form filling. --- # Targets: opendataloader-pdf >= 2.2.0 diff --git a/skills/odl-pdf/references/options-matrix.md b/skills/odl-pdf/references/options-matrix.md index f62dce7fd..1b3e8e159 100644 --- a/skills/odl-pdf/references/options-matrix.md +++ b/skills/odl-pdf/references/options-matrix.md @@ -3,7 +3,8 @@ This file contains a built-in summary of all 26 CLI options for the `opendataloader-pdf` tool. If `options.json` is present in the project root, that file is the authoritative source — always prefer it over the descriptions here. This document exists so the agent skill can reason about -options without loading the full JSON on every invocation. +options without loading the full JSON on every invocation, and adds **category groupings**, +**Interaction Rules**, and **Common Combinations** that the raw schema does not express. --- From e8df66bd91dadcca3e840a300532bcfd9bfc8737 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Wed, 22 Apr 2026 13:58:30 +0900 Subject: [PATCH 11/13] refactor(skill): drop API-based eval runner and workflow from PR scope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: the skill shipped with evals.json as a scenario spec from its original commit (14cf497). Later rounds of this PR added an Anthropic-API-based runner (scripts/run-evals.py), a manual-trigger CI workflow (.github/workflows/skill-evals.yml), and a usage README (evals/README.md) that together automated multi-model verification against that spec. On reflection this infrastructure sat outside the PR's actual scope. The external review did not ask for it; it came from an internal audit criterion. Running it on any meaningful cadence requires a maintainer to register an ANTHROPIC_API_KEY secret and spend API credits the project never committed to spending, and shipping a runner that has never actually been run against the real API amounts to a promise the PR cannot back. Approach: drop the three API-adjacent artifacts. Preserve evals.json itself along with every improvement this PR made to it: eval-004 must_mention alignment (Terminal 1 / Terminal 2 replacing the "two terminals" phrase the skill never teaches); must_not_mention hardening in eval-002 against JDK-distribution recommendations and in eval-005 against fabricated VLM names and --picture-description-* options; the eval-006 error scenario (UnsupportedClassVersionError) and eval-007 boundary scenario (password-protected PDF). evals.json remains a readable spec — a contract future maintainers can verify by any means they choose, without locking this PR into a specific execution path that has costs the project did not opt into. - Delete scripts/run-evals.py - Delete .github/workflows/skill-evals.yml - Delete evals/README.md - Remove the two run-evals.py steps from .github/workflows/skill-smoke-test.yml (the other 10 steps remain and exercise the shell scripts, quick-eval.py, and sync-skill-refs.py across ubuntu/windows/macos without consuming API credits). - Remove the paragraph in CLAUDE.md that directed maintainers to manually trigger skill-evals.yml after substantive skill changes; retain the factual one-sentence description of the smoke-test workflow and the 6-step manual-update checklist. evals.json is unchanged by this commit. Evidence: yaml.safe_load on skill-smoke-test.yml still parses; 10 steps remain and cover the non-API surface. sync-skill-refs.py still reports "No drift detected". CLAUDE.md Agent Skills section now ends with the existing English-only policy line without any maintainer action prescription tied to skill-evals.yml. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/skill-evals.yml | 60 ------- .github/workflows/skill-smoke-test.yml | 18 +-- CLAUDE.md | 8 +- skills/odl-pdf/evals/README.md | 79 ---------- skills/odl-pdf/scripts/run-evals.py | 207 ------------------------- 5 files changed, 4 insertions(+), 368 deletions(-) delete mode 100644 .github/workflows/skill-evals.yml delete mode 100644 skills/odl-pdf/evals/README.md delete mode 100644 skills/odl-pdf/scripts/run-evals.py diff --git a/.github/workflows/skill-evals.yml b/.github/workflows/skill-evals.yml deleted file mode 100644 index b2bb99b5a..000000000 --- a/.github/workflows/skill-evals.yml +++ /dev/null @@ -1,60 +0,0 @@ -# skill-evals.yml -# Runs the odl-pdf skill scenario evaluations against multiple Claude models. -# Manual trigger only — each run consumes Anthropic API credits. -# -# Required repo secret: ANTHROPIC_API_KEY - -name: Skill Evaluations - -on: - workflow_dispatch: - inputs: - models: - description: "Comma-separated model IDs (blank = Haiku 4.5, Sonnet 4.6, Opus 4.7)" - required: false - default: "" - max_tokens: - description: "Max output tokens per call" - required: false - default: "2048" - -permissions: - contents: read - -jobs: - run-evals: - runs-on: ubuntu-latest - timeout-minutes: 20 - steps: - - uses: actions/checkout@v4 - - - uses: actions/setup-python@v5 - with: - python-version: '3.12' - - - name: Install anthropic SDK - run: pip install anthropic - - - name: Run skill evaluations - env: - ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} - run: | - MODEL_ARGS="" - if [ -n "${{ inputs.models }}" ]; then - IFS=',' read -ra MODELS <<< "${{ inputs.models }}" - for m in "${MODELS[@]}"; do - MODEL_ARGS="$MODEL_ARGS --model $(echo $m | xargs)" - done - fi - python skills/odl-pdf/scripts/run-evals.py \ - --max-tokens "${{ inputs.max_tokens }}" \ - $MODEL_ARGS - - - name: Upload report - if: always() - uses: actions/upload-artifact@v4 - with: - name: skill-evals-report - path: skills/odl-pdf/evals/runs/*.json - if-no-files-found: warn - retention-days: 30 diff --git a/.github/workflows/skill-smoke-test.yml b/.github/workflows/skill-smoke-test.yml index 7ba0d7f11..7d4693df9 100644 --- a/.github/workflows/skill-smoke-test.yml +++ b/.github/workflows/skill-smoke-test.yml @@ -2,9 +2,7 @@ # Cross-platform smoke test for the odl-pdf skill's executable assets. # Runs the shell scripts and Python scripts on ubuntu / windows / macos # to catch platform-specific regressions (line endings, console encoding, -# shell portability) BEFORE a PR merges. -# -# Does NOT hit the Anthropic API; for that, see skill-evals.yml (manual). +# shell portability) BEFORE a PR merges. Does NOT hit any external API. name: Skill Smoke Test @@ -106,17 +104,3 @@ jobs: # --- sync-skill-refs.py -------------------------------------------- - name: sync-skill-refs.py reports no drift run: python skills/odl-pdf/scripts/sync-skill-refs.py - - # --- run-evals.py (no API) ----------------------------------------- - - name: run-evals.py --help - run: python skills/odl-pdf/scripts/run-evals.py --help - - - name: run-evals.py missing-key exits 2 - env: - ANTHROPIC_API_KEY: "" - run: | - set +e - python skills/odl-pdf/scripts/run-evals.py - rc=$? - set -e - [ "$rc" = "2" ] || { echo "expected exit 2, got $rc"; exit 1; } diff --git a/CLAUDE.md b/CLAUDE.md index 606021580..3a7d28774 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -51,11 +51,9 @@ the others are NOT auto-checked and will silently go stale if missed: metric (NID / TEDS / MHS / Table Detection F1): also update the corresponding Low-* section of `skills/odl-pdf/references/eval-metrics.md`. -After substantive skill changes, manually trigger the -`skill-evals.yml` workflow (Actions → Run workflow) to re-run the multi-model -evaluation suite. The smoke-test workflow (`skill-smoke-test.yml`) runs -automatically on push and verifies cross-platform shell + Python script -behavior, but does not exercise model behavior. +The `skill-smoke-test.yml` workflow runs automatically on push and +verifies cross-platform shell and Python script behavior on +ubuntu/windows/macos; it does not exercise model behavior. The skill is written in English for external users. Do not include internal team terminology or company-specific policies. diff --git a/skills/odl-pdf/evals/README.md b/skills/odl-pdf/evals/README.md deleted file mode 100644 index 179730f5f..000000000 --- a/skills/odl-pdf/evals/README.md +++ /dev/null @@ -1,79 +0,0 @@ -# odl-pdf Skill Evaluations - -This directory holds the scenario-based evaluations for the `odl-pdf` skill and the results of running them against Claude models. - -## Files - -| File | Purpose | -|---|---| -| `evals.json` | Scenario definitions: user inputs, expected recommendations, required phrases, forbidden phrases | -| `runs/.json` | One report per evaluation run. Committed when the run is meaningful evidence (e.g., after a significant skill change) | - -## Running the Evaluations - -The runner lives at `scripts/run-evals.py`. It loads `SKILL.md` as the system prompt, sends each scenario's `user_input` as a user message to each target model, and checks the response against `must_mention` (all phrases must appear) and `must_not_mention` (none may appear). - -### Prerequisites - -```bash -pip install anthropic -export ANTHROPIC_API_KEY=sk-ant-... -``` - -### Default run — Haiku 4.5, Sonnet 4.6, Opus 4.7 - -```bash -python scripts/run-evals.py -``` - -Writes `evals/runs/.json` and exits `0` if all pass, `1` if any fail. - -### Run against one model - -```bash -python scripts/run-evals.py --model claude-sonnet-4-6 -``` - -The `--model` flag can be repeated to target a specific subset. - -### Other flags - -- `--max-tokens ` — raise the per-call output limit (default 2048) -- `--skip-cache` — disable `cache_control` on the system prompt (useful for one-shot checks) -- `--output ` — override the default report path - -## Interpreting Reports - -Each run report contains: - -- `summary.pass` / `summary.fail` / `summary.total` — aggregate counts across all (model × scenario) cells -- `results[]` — one entry per cell with `pass`, `missing_required`, `leaked_forbidden`, token `usage`, elapsed time, and a `response_preview` (first 500 chars) - -Typical failure modes: - -- **Missing required phrase** — the model did not surface a concept the skill should have prompted (e.g., failed to mention `--hybrid-mode full` for enrichment scenarios) -- **Leaked forbidden phrase** — the model proposed an approach the skill explicitly warns against (e.g., looping `convert()` per file) -- **API error** — the `error` field is set; no tokens were consumed on that cell - -## CI - -The evaluation runner is also invocable via GitHub Actions (`.github/workflows/skill-evals.yml`). The workflow is **manual-trigger only** (`workflow_dispatch`) because each run consumes Anthropic API credits; it is not wired to every PR. Maintainers should run it: - -- After substantive `SKILL.md` or reference edits -- Before tagging a release -- On request when a new model becomes available - -The workflow reads `ANTHROPIC_API_KEY` from a repository secret of the same name. - -## Adding a New Scenario - -1. Append an entry to `evals.json` under `evals[]` with a fresh `id` (e.g., `eval-006`). -2. Include `scenario`, `user_input`, `expected_recommendations`, `must_mention`, and `must_not_mention`. -3. Run `python scripts/run-evals.py` locally and confirm the new case passes on at least one model before committing. -4. If the case reveals a gap, update `SKILL.md` or a reference file first — do not lower the bar in `must_mention` to make a failing case pass. - -Scenario coverage should include: - -- **Normal** — straightforward use of a core feature -- **Error** — recoverable failure mode (missing prerequisite, silent-skip trap) -- **Boundary** — edge conditions (very large input, unusual OS, password-protected PDF, etc.) diff --git a/skills/odl-pdf/scripts/run-evals.py b/skills/odl-pdf/scripts/run-evals.py deleted file mode 100644 index 97f0c3fed..000000000 --- a/skills/odl-pdf/scripts/run-evals.py +++ /dev/null @@ -1,207 +0,0 @@ -#!/usr/bin/env python3 -"""Run the odl-pdf skill evaluations against multiple Claude models. - -Loads `evals/evals.json` scenarios and sends each `user_input` to each -target model with `SKILL.md` as the system prompt. Checks each response -for `must_mention` (all phrases present) and `must_not_mention` (none -present). Writes a JSON report to `evals/runs/.json`. - -Exit codes: - 0 all runs passed - 1 at least one run failed (missing required or leaked forbidden) - 2 setup error (missing API key, missing files, SDK not installed) - -Usage: - export ANTHROPIC_API_KEY=... - python scripts/run-evals.py - python scripts/run-evals.py --model claude-haiku-4-5-20251001 - python scripts/run-evals.py --skip-cache --max-tokens 4096 -""" - -from __future__ import annotations - -import argparse -import json -import os -import sys -import time -from datetime import datetime, timezone -from pathlib import Path - -DEFAULT_MODELS = [ - "claude-haiku-4-5-20251001", - "claude-sonnet-4-6", - "claude-opus-4-7", -] - -SKILL_DIR = Path(__file__).resolve().parent.parent -SKILL_MD = SKILL_DIR / "SKILL.md" -EVALS_JSON = SKILL_DIR / "evals" / "evals.json" -RUNS_DIR = SKILL_DIR / "evals" / "runs" - - -def load_skill_system_prompt() -> str: - text = SKILL_MD.read_text(encoding="utf-8") - return ( - "You are using the `odl-pdf` agent skill to help a user with " - "opendataloader-pdf. The skill content follows. Treat it as " - "authoritative guidance and answer the user's question by applying " - "the workflow and recommendations defined below.\n\n" - "---\n\n" + text - ) - - -def check_phrase(phrase: str, haystack: str) -> bool: - return phrase.lower() in haystack.lower() - - -def evaluate_response(eval_case: dict, response_text: str) -> dict: - required = eval_case.get("must_mention", []) - forbidden = eval_case.get("must_not_mention", []) - missing = [p for p in required if not check_phrase(p, response_text)] - leaked = [p for p in forbidden if check_phrase(p, response_text)] - return { - "pass": not missing and not leaked, - "missing_required": missing, - "leaked_forbidden": leaked, - "required_total": len(required), - "forbidden_total": len(forbidden), - } - - -def run_one(client, model: str, system_text: str, user_input: str, - use_cache: bool, max_tokens: int) -> tuple[str, dict]: - system_block = {"type": "text", "text": system_text} - if use_cache: - system_block["cache_control"] = {"type": "ephemeral"} - - resp = client.messages.create( - model=model, - max_tokens=max_tokens, - system=[system_block], - messages=[{"role": "user", "content": user_input}], - ) - text = "".join(b.text for b in resp.content if getattr(b, "type", "") == "text") - usage = { - "input_tokens": resp.usage.input_tokens, - "output_tokens": resp.usage.output_tokens, - "cache_creation_input_tokens": getattr(resp.usage, "cache_creation_input_tokens", 0), - "cache_read_input_tokens": getattr(resp.usage, "cache_read_input_tokens", 0), - } - return text, usage - - -def main() -> int: - ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) - ap.add_argument("--model", action="append", default=None, - help="Model ID to run. Repeatable. Defaults to Haiku 4.5, Sonnet 4.6, Opus 4.7.") - ap.add_argument("--skip-cache", action="store_true", - help="Do not set cache_control on the system prompt (useful for one-shot checks).") - ap.add_argument("--max-tokens", type=int, default=2048, - help="Max output tokens per call (default: 2048).") - ap.add_argument("--output", type=Path, default=None, - help="Report path. Defaults to evals/runs/.json.") - args = ap.parse_args() - - try: - from anthropic import Anthropic - except ImportError: - print("ERROR: the `anthropic` package is not installed. Run `pip install anthropic`.", file=sys.stderr) - return 2 - - if not os.getenv("ANTHROPIC_API_KEY"): - print("ERROR: ANTHROPIC_API_KEY is not set.", file=sys.stderr) - return 2 - - if not EVALS_JSON.exists(): - print(f"ERROR: missing {EVALS_JSON}.", file=sys.stderr) - return 2 - if not SKILL_MD.exists(): - print(f"ERROR: missing {SKILL_MD}.", file=sys.stderr) - return 2 - - models = args.model if args.model else list(DEFAULT_MODELS) - evals_data = json.loads(EVALS_JSON.read_text(encoding="utf-8")) - cases = evals_data.get("evals", []) - if not cases: - print("ERROR: evals.json contains no scenarios.", file=sys.stderr) - return 2 - - system = load_skill_system_prompt() - client = Anthropic() - - started = datetime.now(timezone.utc).replace(microsecond=0).isoformat() - results = [] - - for model in models: - for case in cases: - t0 = time.time() - try: - text, usage = run_one( - client, model, system, case["user_input"], - use_cache=not args.skip_cache, - max_tokens=args.max_tokens, - ) - score = evaluate_response(case, text) - error = None - except Exception as exc: # noqa: BLE001 - text = "" - usage = {} - error = repr(exc) - score = { - "pass": False, - "missing_required": case.get("must_mention", []), - "leaked_forbidden": [], - "required_total": len(case.get("must_mention", [])), - "forbidden_total": len(case.get("must_not_mention", [])), - } - - results.append({ - "model": model, - "eval_id": case["id"], - "scenario": case["scenario"], - "pass": score["pass"], - "missing_required": score["missing_required"], - "leaked_forbidden": score["leaked_forbidden"], - "elapsed_s": round(time.time() - t0, 2), - "usage": usage, - "error": error, - "response_preview": text[:500], - }) - - status = "PASS" if score["pass"] else "FAIL" - print(f"[{status}] {model} :: {case['id']} " - f"({len(score['missing_required'])} missing, " - f"{len(score['leaked_forbidden'])} leaked, " - f"{results[-1]['elapsed_s']}s)") - - finished = datetime.now(timezone.utc).replace(microsecond=0).isoformat() - summary = { - "pass": sum(1 for r in results if r["pass"]), - "fail": sum(1 for r in results if not r["pass"]), - "total": len(results), - } - - report = { - "started_utc": started, - "finished_utc": finished, - "skill": evals_data.get("skill", "odl-pdf"), - "models": models, - "cache_enabled": not args.skip_cache, - "max_tokens": args.max_tokens, - "summary": summary, - "results": results, - } - - RUNS_DIR.mkdir(parents=True, exist_ok=True) - out_path = args.output or (RUNS_DIR / f"{started.replace(':', '-')}.json") - out_path.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8") - - print(f"\nSummary: {summary['pass']}/{summary['total']} passed across {len(models)} model(s).") - print(f"Report: {out_path}") - - return 0 if summary["fail"] == 0 else 1 - - -if __name__ == "__main__": - sys.exit(main()) From 4db0902cef9df813f0f4bf82b9d11ad29148ed57 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Wed, 22 Apr 2026 14:56:06 +0900 Subject: [PATCH 12/13] docs(skill): link Version Compatibility table to canonical manifests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: the Version Compatibility table in installation-matrix.md hardcoded Python 3.10+, Node.js 20.19+, and Java 11+ as minimum runtime floors. Those values are currently accurate, but they are duplicated from three manifests that this PR's drift CI does not watch: - python/opendataloader-pdf/pyproject.toml `requires-python` - node/opendataloader-pdf/package.json `engines.node` - java/pom.xml `maven.compiler.source` skill-drift-check.yml's paths only trigger on options.json and sync-skill-refs.py. The sync script only compares option names, not version strings. The CLAUDE.md 6-step manual-update checklist is scoped to "When adding or changing CLI options" and has no entry for a runtime floor bump. So if a maintainer raises `requires-python` to 3.11, the skill keeps claiming "3.10+" silently until someone catches it by hand. This is the same drift-silently-rot shape as round-2 review ask #4, applied to runtime versions instead of CLI options. Approach: replace hardcoded version numbers with manifest pointers. The table now names the source-of-truth file and field for each method instead of caching a value that can go stale: - "pip (all variants)" → `pyproject.toml` `requires-python` - "pip langchain" → above, plus the LangChain floor the external langchain-opendataloader-pdf package declares - "npm" → `package.json` `engines.node` - "Maven" → `pom.xml` `maven.compiler.source` An introductory sentence notes that `pip` / `npm` / `mvn` all validate against the manifest's floor at install time, so a too-old environment fails with a clear tool-native error without the skill needing to duplicate the value. The Java 11+ runtime note stays because the pip and npm wrappers spawn a JVM independently of the primary runtime — that requirement is a skill-level concern, not a manifest value, and Critical Gotcha 1 in SKILL.md already states it authoritatively. This is strictly a derivability cleanup (aligned with round-2 ask #3 "content derivable from code should not be duplicated"). No other file references the removed numeric values; Gotcha 1 and the Prerequisites section retain their "Java 11 or higher" prose because that is the skill's own constraint rather than a manifest mirror. Evidence: diff stat +14 / -7. Drift check still reports "No drift detected" (option-name-scoped; unaffected by this change). Remaining references to Python / Node / Java version numbers in the skill files: (a) SKILL.md Gotcha 1 "Java 11 or higher is required" and (b) installation-matrix.md Prerequisites section "Java 11 or higher is required" — both skill-level Java constraints, not manifest values. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../odl-pdf/references/installation-matrix.md | 21 ++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/skills/odl-pdf/references/installation-matrix.md b/skills/odl-pdf/references/installation-matrix.md index a0cb27015..e3f0a85f2 100644 --- a/skills/odl-pdf/references/installation-matrix.md +++ b/skills/odl-pdf/references/installation-matrix.md @@ -95,15 +95,22 @@ Pin `LATEST` to a specific released version from the [releases page](https://git ## Version Compatibility -| Method | Minimum Runtime | CLI Included | +Minimum runtime requirements are declared in each package's manifest. Consult +the manifest for the authoritative current floor — the wrappers and build tools +enforce it at install time: + +| Method | Runtime requirement (source of truth) | CLI Included | |---|---|---| -| pip | Python 3.10+ | Yes | -| pip [hybrid] | Python 3.10+ | Yes | -| pip langchain | Python 3.10+, LangChain 0.1+ | Yes | -| npm | Node.js 20.19+ | Yes | -| Maven | Java 11+ | No (library only) | +| pip (all variants) | `python/opendataloader-pdf/pyproject.toml` → `requires-python` | Yes | +| pip langchain | above, plus the LangChain floor declared by `langchain-opendataloader-pdf` | Yes | +| npm | `node/opendataloader-pdf/package.json` → `engines.node` | Yes | +| Maven | `java/pom.xml` → `maven.compiler.source` | No (library only) | + +`pip` / `npm` / `mvn` each validate against the manifest's declared floor and +fail with a clear error if the environment is below it. -All methods also require **Java 11+** regardless of the primary runtime. +All methods additionally require **Java 11 or higher** at runtime; the pip and +npm wrappers spawn a JVM internally. See Critical Gotcha 1 in `SKILL.md`. ## Post-Install Verification From a23f865a5a59eec58d142437fb5637ad19ca34f1 Mon Sep 17 00:00:00 2001 From: hyunhee-jo Date: Thu, 23 Apr 2026 14:20:08 +0900 Subject: [PATCH 13/13] docs(skill): pre-merge polish --- .claude-plugin/marketplace.json | 4 +- .gitignore | 1 + CLAUDE.md | 41 ++++++++++++++++ README.md | 8 +++- skills/README.md | 10 ++-- skills/odl-pdf/SKILL.md | 47 +++++++++++++------ skills/odl-pdf/references/eval-metrics.md | 27 ++++++----- skills/odl-pdf/references/format-guide.md | 2 + skills/odl-pdf/references/hybrid-guide.md | 23 +++++---- .../odl-pdf/references/installation-matrix.md | 23 ++++++--- .../references/integration-examples.md | 2 +- skills/odl-pdf/references/options-matrix.md | 2 +- 12 files changed, 137 insertions(+), 53 deletions(-) diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index d37103183..32e9e65e5 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -5,12 +5,14 @@ }, "metadata": { "description": "AI-powered PDF extraction guidance and automation", - "version": "1.0.0" + "version": "0.1.0" }, "plugins": [ { "name": "odl-pdf-skills", + "version": "0.1.0", "description": "Expert guidance for opendataloader-pdf — environment detection, option recommendations, hybrid mode setup, quality diagnostics, and direct conversion execution", + "homepage": "https://github.com/opendataloader-project/opendataloader-pdf/tree/main/skills/odl-pdf", "source": "./", "skills": [ "./skills/odl-pdf" diff --git a/.gitignore b/.gitignore index 2fbfac1ce..78e7fcbda 100644 --- a/.gitignore +++ b/.gitignore @@ -78,3 +78,4 @@ content/docs/ # Configuration files .claude/settings.local.json .claude/plans/ +.claude/review-rounds.md diff --git a/CLAUDE.md b/CLAUDE.md index 3a7d28774..d7378e7ea 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -55,5 +55,46 @@ The `skill-smoke-test.yml` workflow runs automatically on push and verifies cross-platform shell and Python script behavior on ubuntu/windows/macos; it does not exercise model behavior. +When bumping the minimum Java version (raising +`` / `` in `java/pom.xml`), +also update every explicit "Java 11" / "Java 11+" mention in these +skill files — pip-installed users do not have `java/pom.xml` on disk +and rely on the skill to state the concrete minimum: + +- `skills/odl-pdf/SKILL.md` — Persona, Phase 2A prerequisite and the + user-facing message, Action Mode A1 environment check, Gotcha 1 + (title, body, Resolution, user-facing message), Session Checklist +- `skills/odl-pdf/references/installation-matrix.md` — Prerequisites + paragraph and the Version Compatibility table's footer note +- `skills/odl-pdf/references/integration-examples.md` — opening + requirement line +- `skills/odl-pdf/evals/evals.json` — eval-006 `must_mention` array + (currently `"Java 11"`) + +The same pattern applies when the Python or Node.js runtime floor +bumps, though the urgency is asymmetric: + +- **Node.js — peer to Java in silent-failure terms.** `npm` treats + `engines.node` in `node/opendataloader-pdf/package.json` as advisory + by default (`npm warn EBADENGINE` then installs anyway), so a user + below the floor gets a cryptic runtime error rather than a blocked + install. When bumping `engines.node`, grep for the current value + (e.g. `Node.js 20.19+`) across `skills/odl-pdf/` and update every + match. `pnpm` is strict by default, but the skill cannot assume the + user's package manager. +- **Python — loud install failure.** Modern `pip` strictly enforces + `requires-python` in `python/opendataloader-pdf/pyproject.toml` and + refuses to install with a clear error. Surfacing the floor still + saves an agent-user round-trip, so when bumping `requires-python`, + grep for the current value (e.g. `Python 3.10+`) across + `skills/odl-pdf/` and update every match. + +Current Python/Node.js floor mentions live in the same skill locations +as the Java ones: `SKILL.md` Persona, Phase 2A decision tree and +default note, Session Checklist; `installation-matrix.md` Decision +Tree and Prerequisites; `integration-examples.md` opening line. Grep +is the authoritative discovery method for either bump — the file list +above is a navigation aid, not a substitute for a fresh grep. + The skill is written in English for external users. Do not include internal team terminology or company-specific policies. diff --git a/README.md b/README.md index 0501c936d..52a14713e 100644 --- a/README.md +++ b/README.md @@ -456,7 +456,7 @@ Existing PDFs (untagged) Your AI coding agent knows how to use opendataloader-pdf — optimal options, hybrid mode setup, and quality diagnostics, all handled automatically. -Works with **Claude Code**, **Codex**, **Gemini CLI**, **Cursor**, **VS Code**, and 26+ platforms via [agentskills.io](https://agentskills.io) spec. +Follows the [Agent Skills](https://agentskills.io) open format. Native support in **Claude Code** via the included [`.claude-plugin/marketplace.json`](.claude-plugin/marketplace.json). ### What the Skill Does @@ -470,11 +470,15 @@ Works with **Claude Code**, **Codex**, **Gemini CLI**, **Cursor**, **VS Code**, ### Install +Requires Java 11+ and Python 3.10+ with `opendataloader-pdf >= 2.2.0` (Node.js 20.19+ or Java SDK also supported). + ```bash npx skills add opendataloader-project/opendataloader-pdf --skill odl-pdf ``` -Or use the `/odl-pdf` slash command in Claude Code after installing the plugin. +After installation, invoke with `/odl-pdf` in Claude Code. + +For clients without a skills installer, copy [`skills/odl-pdf/`](skills/odl-pdf/) into the client's skills directory (location varies by client — see its docs). ## Roadmap diff --git a/skills/README.md b/skills/README.md index a65467909..60c730429 100644 --- a/skills/README.md +++ b/skills/README.md @@ -1,6 +1,6 @@ # Agent Skills -opendataloader-pdf ships built-in agent skills that help AI coding assistants use this project effectively. Skills follow the [agentskills.io](https://agentskills.io) specification and work with Claude Code, Codex, Gemini CLI, Cursor, VS Code, and 26+ platforms. +opendataloader-pdf ships built-in agent skills that help AI coding assistants use this project effectively. Skills follow the [Agent Skills](https://agentskills.io) open format. This repository is packaged for Claude Code via [`.claude-plugin/marketplace.json`](../.claude-plugin/marketplace.json). ## Directory Structure @@ -14,6 +14,7 @@ skills/ │ ├── hybrid-guide.md │ ├── format-guide.md │ ├── installation-matrix.md + │ ├── integration-examples.md │ └── eval-metrics.md ├── scripts/ ← Executable helpers │ ├── detect-env.sh @@ -30,8 +31,8 @@ skills/ | Level | Content | When Loaded | |-------|---------|-------------| -| **L1** | `description` field in SKILL.md frontmatter (~100 words) | Always visible to skill router | -| **L2** | SKILL.md body (~400 lines) — persona, workflows, decision trees, gotchas | When skill is activated | +| **L1** | `description` field in SKILL.md frontmatter | Always visible to skill router | +| **L2** | SKILL.md body — persona, workflows, decision trees, gotchas | When skill is activated | | **L3** | `references/*` files — detailed option matrices, guides, metrics | When the user enters that topic | This design minimizes token usage. The AI agent only loads what it needs for the current task. @@ -159,7 +160,8 @@ python skills/odl-pdf/scripts/sync-skill-refs.py ## References -- [agentskills.io specification](https://agentskills.io) — Multi-agent skill format standard +- [Agent Skills](https://agentskills.io) — Open format spec for agent skills +- [`skills` CLI](https://skills.sh) — CLI that installs Agent Skills (`vercel-labs/skills`); used by the `npx skills add ...` command in the root README's install section - [Claude Code Skills](https://docs.anthropic.com/en/docs/claude-code) — Claude Code skill documentation - `.claude-plugin/marketplace.json` — Plugin registration for this project - `CLAUDE.md` — Internal development notes (not for the skill) diff --git a/skills/odl-pdf/SKILL.md b/skills/odl-pdf/SKILL.md index b7b8e5848..70be7f9ea 100644 --- a/skills/odl-pdf/SKILL.md +++ b/skills/odl-pdf/SKILL.md @@ -7,14 +7,18 @@ description: > --hybrid-mode full, slow batches from per-file JVM startup) that the README does not surface up-front. Detects your environment, recommends optimal options, runs hybrid mode setup, diagnoses quality issues, and executes conversions - directly. Use when: 'PDF extraction', 'PDF to markdown', 'PDF to JSON', - 'PDF to HTML', 'opendataloader', 'ODL', 'hybrid mode', 'scanned PDF', 'OCR', - 'PDF tables', 'RAG pipeline with PDF', 'PDF accessibility', 'PDF/UA'. - Do NOT use for: PDF merge/split/rotate, Word/Excel conversion, PDF form filling. + directly. Use when: 'PDF extraction', 'PDF parser', 'PDF parsing', + 'open source PDF parser', 'extract text from PDF', 'PDF to text', + 'PDF to markdown', 'PDF to JSON', 'PDF to HTML', 'opendataloader', 'ODL', + 'hybrid mode', 'scanned PDF', 'OCR', 'PDF tables', 'PDF table extraction', + 'PDF chunking', 'PDF for LLM', 'PDF bounding boxes', 'RAG pipeline with PDF'. + Do NOT use for: PDF merge/split/rotate, Word/Excel conversion, PDF form filling, + PDF/UA generation, PDF accessibility tagging. +license: Apache-2.0 --- # Targets: opendataloader-pdf >= 2.2.0 -# Last synced options.json: 26 options +# Documented against: 2.2.1 (features added in 2.3.0+ are not yet covered) --- @@ -27,7 +31,7 @@ You are a **Document Intelligence Engineer** — not merely a PDF expert, but an - You understand PDF internals: structure trees, bounding boxes, content streams, reading order algorithms, and the difference between digital and scanned PDFs. - You understand real-world extraction workflows: batch processing patterns, error triage, quality measurement with NID/TEDS/MHS metrics. - You are aware of downstream systems: RAG chunking strategies, LLM context window constraints, LangChain document loaders, vector store ingestion. -- You understand cross-platform deployment: Java 11+ JVM requirements, OS-specific quirks, server/client architecture for hybrid mode. +- You understand cross-platform deployment: per-runtime version floors (Java 11+ per `java/pom.xml`, Python 3.10+ per `pyproject.toml`, Node.js 20.19+ per `package.json`), OS-specific quirks, server/client architecture for hybrid mode. **Interaction style:** Diagnose first, prescribe later. Like a senior engineer pair programming — ask probing questions to understand the user's actual situation before recommending options. Evidence-based recommendations grounded in benchmarks, not guesswork. @@ -102,23 +106,27 @@ Based on Phase 1 findings, make specific recommendations across four dimensions. ``` Environment detection: -├── Python available? +├── Python 3.10+ available? │ ├── Complex tables / OCR / formulas needed? │ │ └── pip install "opendataloader-pdf[hybrid]" │ ├── LangChain RAG pipeline? │ │ └── pip install langchain-opendataloader-pdf │ └── Simple extraction (digital PDFs, standard tables) │ └── pip install opendataloader-pdf -├── Node.js only? +├── Python present but below 3.10? +│ └── Upgrade Python to 3.10+, or use the Node.js / Java path below +│ (pip will refuse to install the current package with the +│ actual Python-version error if you try) +├── Node.js 20.19+ only? │ └── npm install @opendataloader/pdf ├── Java project (Maven/Gradle)? │ └── Add Maven dependency (see references/installation-matrix.md) └── Unsure / getting started? - └── pip install opendataloader-pdf (simplest path) + └── pip install opendataloader-pdf (simplest path; requires Python 3.10+) ``` **Critical prerequisite — Java 11+:** -All installation paths require Java 11 or higher. Python and Node.js wrappers spawn a JVM internally. Verify with `java -version`. +All installation paths require Java 11 or higher. Python and Node.js wrappers spawn a JVM internally. Verify with `java -version`. The authoritative current floor is `maven.compiler.source` in `java/pom.xml`; if that bumps, this skill must be updated (see `CLAUDE.md`). If Java is missing or below version 11: > "Java 11 or higher is required. Please install a JDK for your environment." @@ -137,7 +145,7 @@ Do NOT recommend specific JDK distributions or provide download links. PDF characteristics: │ ├── Digital PDF + clear bordered tables -│ └── Local only, --table-method default (~0.05s/page, no server needed) +│ └── Local only, --table-method default (fastest, no server needed) │ ├── Digital PDF + borderless or complex tables │ └── --table-method cluster (local, slightly slower) @@ -174,6 +182,8 @@ opendataloader-pdf input.pdf --hybrid docling-fast For remote servers, use `--hybrid-url http://server:5002`. +**Pre-flight check** — before the first hybrid run of a session, confirm the server is reachable with `scripts/hybrid-health.sh` (exit 0 if ready). This catches "connection refused" before a full conversion attempt and is cheaper than parsing a failed client log. + --- ### 2C. Output Format Selection @@ -356,7 +366,10 @@ Step 3: Check for scanned PDF **Text is garbled or contains replacement characters:** ``` Step 1: Check for encoding issues - Add: --replace-invalid-chars "?" (makes bad characters visible) + Add: --replace-invalid-chars "?" (overrides the default space so + CID-decode failures stand out visually instead of blending + into whitespace — distinguishes font-encoding problems from + true scan artifacts at a glance) Step 2: If it's a scanned PDF Switch to: --hybrid docling-fast (+ server --force-ocr) @@ -469,6 +482,10 @@ Common operational flags (details in `references/integration-examples.md` § Out - `--pages "1,3,5-10"` — restrict processing to a page range - `--markdown-page-separator` / `--text-page-separator` / `--html-page-separator` — inject a custom marker between pages for downstream splitting (supports `%page-number%`) +### 5E. Large PDFs in Hybrid Mode + +Since 2.2.1 the Java client automatically chunks backend-routed pages into 50-page windows before sending them to the hybrid server. A 200-page scanned PDF in `--hybrid-mode full` will no longer hang the backend, and users migrating from earlier versions no longer need to manually split large documents. This is transparent — no flag required. See `references/hybrid-guide.md` § Performance Notes. + --- ## Critical Gotchas @@ -477,7 +494,7 @@ These three issues cause the majority of user-reported problems. Check these bef ### Gotcha 1: Java 11+ Is Always Required -**Every installation path requires Java 11 or higher.** Python packages, Node.js packages, and the CLI all spawn a JVM internally. There is no pure-Python or pure-JavaScript path. +**Every installation path requires Java 11 or higher.** Python packages, Node.js packages, and the CLI all spawn a JVM internally. There is no pure-Python or pure-JavaScript path. The authoritative current floor is `maven.compiler.source` in `java/pom.xml`; this skill is updated when that bumps. **Symptom:** `java.lang.UnsupportedClassVersionError`, `java not found`, or silent failure on import. @@ -534,7 +551,7 @@ For CLI batch processing, prefer a glob pattern or a file list argument over she ## Option Reference -This skill reasons about all 26 CLI options without loading their full descriptions. When the user needs option details, defaults, or interactions, load `references/options-matrix.md` (grouped by IO / Quality / Safety / Hybrid / Output / Text categories, with common combination recipes). +This skill reasons about every CLI option declared in `options.json` without loading the full descriptions. When the user needs option details, defaults, or interactions, load `references/options-matrix.md` (grouped by IO / Quality / Safety / Hybrid / Output / Text categories, with common combination recipes). Authoritative source order: @@ -585,7 +602,7 @@ Use this as a mental checklist for any extraction request: - [ ] Phase 1: Run detect-env.sh or ask about environment - [ ] Phase 1: Know the PDF type (digital/scanned/mixed) - [ ] Phase 1: Know the downstream use case -- [ ] Phase 2: Confirm Java 11+ is present +- [ ] Phase 2: Confirm runtime floors (Java 11+ always; Python 3.10+ if pip path; Node.js 20.19+ if npm path) - [ ] Phase 2: Selected local vs. hybrid based on PDF type - [ ] Phase 2: Selected output format based on downstream use - [ ] Phase 3: Generated or executed the command diff --git a/skills/odl-pdf/references/eval-metrics.md b/skills/odl-pdf/references/eval-metrics.md index aa3b988ec..e95b0bfdd 100644 --- a/skills/odl-pdf/references/eval-metrics.md +++ b/skills/odl-pdf/references/eval-metrics.md @@ -58,28 +58,29 @@ This document explains the metrics used in opendataloader-pdf benchmarks, how to **What it measures:** Processing throughput in seconds per page. -**Interpretation:** Lower is better. Scores vary significantly by mode: +**Interpretation:** Lower is better. Relative shape: -| Mode | Approximate throughput | -|------|----------------------| -| Local (no hybrid) | ~0.015 s/page | -| Hybrid `auto` (mixed document) | Varies; most pages stay at Java speed | -| Hybrid `full` | ~0.463 s/page | +- **Local (no hybrid)**: fastest — pure Java layout analysis +- **Hybrid `auto`**: varies with document complexity; most pages stay at Java speed, only triaged pages pay the backend round-trip +- **Hybrid `full`**: slowest — every page goes to the backend -Speed is not normalized to 0–1. It is an absolute wall-clock measurement averaged over the benchmark document set. +Speed is not normalized to 0–1. It is an absolute wall-clock measurement averaged over the benchmark document set. For current numbers, run `./scripts/bench.sh` — published scores can lag the latest code. --- ## Benchmark Reference Scores -**200 real-world PDFs including multi-column layouts and scientific papers.** +Run `./scripts/bench.sh` to produce the current scores against the benchmark document set maintained in [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) (200 real-world PDFs including multi-column layouts and scientific papers). -| Engine | Overall | NID (Reading Order) | TEDS (Table) | MHS (Heading) | Table Detection F1 | Speed (s/page) | -|--------|---------|---------------------|--------------|---------------|--------------------|----------------| -| **opendataloader [hybrid]** | **0.907** | **0.934** | **0.928** | 0.821 | see bench | 0.463 | -| opendataloader [local] | 0.831 | 0.902 | 0.489 | 0.739 | see bench | **0.015** | +Per-metric output shape: -> The `Overall` column is an average of NID / TEDS / MHS. Table Detection F1 is reported per-document by `scripts/bench.sh` but is not currently folded into the Overall average; run the bench for the F1 numbers on the current snapshot. See [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) for methodology. +- **Overall** — the average of NID / TEDS / MHS +- **NID** / **TEDS** / **MHS** / **Table Detection F1** — 0–1 scale, higher is better +- **Speed** — absolute seconds per page + +Table Detection F1 is reported per-document and is not folded into the Overall average. + +Hardcoded snapshot scores are intentionally not reproduced here — they drift whenever the bench is rerun against updated extraction code or benchmark documents. The authoritative current values live in the bench output; the opendataloader-bench README also publishes periodic snapshots. See its methodology section for reference-score context. --- diff --git a/skills/odl-pdf/references/format-guide.md b/skills/odl-pdf/references/format-guide.md index d878f6ed8..8715483c2 100644 --- a/skills/odl-pdf/references/format-guide.md +++ b/skills/odl-pdf/references/format-guide.md @@ -2,6 +2,8 @@ opendataloader-pdf supports 7 output formats via the `format` option. This guide helps you choose the right format for your use case. +> This file documents the 2.2.1 snapshot (matching SKILL.md `# Documented against`). If the project's `options.json` lists a format not covered here, that file is the authoritative source — newer releases may add values this guide has not caught up to yet. + ## Format Overview | Format | Best For | Bounding Boxes | Tables | Images | diff --git a/skills/odl-pdf/references/hybrid-guide.md b/skills/odl-pdf/references/hybrid-guide.md index 50d98583f..b9e71029f 100644 --- a/skills/odl-pdf/references/hybrid-guide.md +++ b/skills/odl-pdf/references/hybrid-guide.md @@ -59,16 +59,18 @@ Control how pages are routed with `--hybrid-mode`. `auto` is the default and works well for mixed documents. The triage strategy is conservative: it prefers to send borderline pages to the backend (minimizing missed complex content) at the cost of some extra backend calls. -Expected throughput: -- Simple pages (Java path): ~0.015 s/page +Expected throughput shape: +- Simple pages (Java path): fastest - Complex pages (backend path): varies by content and hardware - Overall for a mixed document: between the two extremes +For current numbers, run `./scripts/bench.sh`. + ### When to use `full` Use `full` when you need enrichment features (`--enrich-formula`, `--enrich-picture-description`) or when the entire document is scanned and you want consistent OCR output across all pages. -Expected throughput with `full`: approximately 0.5 s/page (depends on backend and GPU availability). +Expected throughput with `full`: noticeably slower than Java-only or `auto`, depending on backend and GPU availability. Run `./scripts/bench.sh` for current per-page timings. > **Important:** `--enrich-formula` and `--enrich-picture-description` are server-side options, but they only take effect when the client is running with `--hybrid-mode full`. In `auto` mode, enrichments are silently skipped — no warning or error is shown. If your output is missing formulas or image descriptions, check that you have `--hybrid-mode full` set on the client side. @@ -93,6 +95,7 @@ All options are passed when starting `opendataloader-pdf-hybrid`. | Option | Default | Description | |--------|---------|-------------| | `--port ` | `5002` | Port the server listens on. | +| `--device ` | `auto` | Accelerator for model inference. Values: `auto`, `cpu`, `cuda`, `mps`, `xpu`. `auto` selects the best available device (checks CUDA, then MPS, then XPU, then CPU). Use `mps` explicitly on Apple Silicon if the auto-selected device is suboptimal, or `cpu` to force CPU-only processing. | | `--force-ocr` | Off | Run OCR on every page, even if the page has selectable text. Use this for scanned PDFs where embedded text is unreliable. | | `--ocr-lang ""` | `"en"` | Comma-separated language codes for OCR (e.g., `"ko,en"`). Improves accuracy for non-English documents. | | `--enrich-formula` | Off | Extract mathematical formulas as LaTeX. **Requires `--hybrid-mode full` on the client.** | @@ -163,12 +166,14 @@ Backends are selected with `--hybrid `. Only one backend can be active per ## Performance Notes -| Processing path | Approximate throughput | -|-----------------|----------------------| -| Java only (no hybrid) | ~0.015 s/page | -| Hybrid `auto` (mixed document) | Varies; most pages stay at Java speed | -| Hybrid `full` | ~0.5 s/page (GPU-accelerated backend recommended) | +Relative throughput: + +- **Java only (no hybrid)**: fastest path +- **Hybrid `auto`** (mixed document): close to Java speed for most pages; only triaged pages pay the backend round-trip +- **Hybrid `full`**: slowest path; GPU-accelerated backend recommended -Latency figures are approximate and depend on document complexity, available hardware, and backend configuration. Running the hybrid server on a machine with a GPU significantly reduces the per-page time in `full` mode. +Latency figures depend on document complexity, available hardware, and backend configuration. Running the hybrid server on a machine with a GPU significantly reduces the per-page time in `full` mode. Run `./scripts/bench.sh` against your own corpus for representative numbers. For throughput-sensitive workloads, use `auto` mode and reserve `full` mode for documents where enrichment or uniform OCR quality is required. + +**Large-document auto-chunking (2.2.1+)** — The Java client automatically splits backend-routed pages into 50-page chunks before sending them to the server. Processing a 200-page scanned PDF in `--hybrid-mode full` no longer hangs the backend. The AI model is loaded once at server startup (singleton), so chunking adds no per-chunk startup cost. No client-side flag; the server's existing `page_ranges` support handles it. Pre-2.2.1 users who manually split large PDFs before processing no longer need to. diff --git a/skills/odl-pdf/references/installation-matrix.md b/skills/odl-pdf/references/installation-matrix.md index e3f0a85f2..12f1fba0b 100644 --- a/skills/odl-pdf/references/installation-matrix.md +++ b/skills/odl-pdf/references/installation-matrix.md @@ -5,26 +5,26 @@ This guide helps you choose the right installation method for your environment. ## Decision Tree ``` -Do you have Python available? +Do you have Python 3.10+ available? ├── Yes │ ├── Do you need LangChain integration? │ │ └── Yes → pip install langchain-opendataloader-pdf │ ├── Do you need hybrid server capability? │ │ └── Yes → pip install "opendataloader-pdf[hybrid]" │ └── Otherwise → pip install opendataloader-pdf (simplest) -├── Node.js only (no Python)? +├── Node.js 20.19+ only (no Python)? │ └── npm install @opendataloader/pdf ├── Java project (Maven/Gradle)? │ └── Add Maven dependency (see below) └── Unsure? - └── pip install opendataloader-pdf (simplest, works on all platforms) + └── pip install opendataloader-pdf (simplest, works on all platforms; requires Python 3.10+) ``` ## Prerequisites -**Java 11 or higher is required for all installation methods.** All methods spawn a JVM internally to perform PDF processing. +**Java 11 or higher is required for all installation methods.** All methods spawn a JVM internally to perform PDF processing. The authoritative current floor is `maven.compiler.source` in `java/pom.xml`; this document is updated when that bumps. -If Java is missing when you run the tool, you will see: +If Java is missing or below version 11 when you run the tool, you will see: > Java 11 or higher is required. Please install a JDK for your environment. @@ -34,6 +34,14 @@ Install a JDK appropriate for your OS before proceeding. Verify with: java -version ``` +**Language-binding runtime floors** are declared in each package's manifest and enforced by the respective package manager at install time: + +- pip: Python >= 3.10 (per `python/opendataloader-pdf/pyproject.toml` `requires-python`) +- npm: Node.js >= 20.19 (per `node/opendataloader-pdf/package.json` `engines.node`) +- Maven: Java >= 11 (same as the JVM floor above) + +If the user's runtime is below the floor, `pip` / `npm` / `mvn` refuse to install with a clear error. Java alone is the exception — it is a runtime requirement of the built JAR, so the CLI fails at use time rather than install time, which is why the upfront `java -version` verification above is explicitly called out. + ## Quick Start Commands ### pip (Python) @@ -109,8 +117,9 @@ enforce it at install time: `pip` / `npm` / `mvn` each validate against the manifest's declared floor and fail with a clear error if the environment is below it. -All methods additionally require **Java 11 or higher** at runtime; the pip and -npm wrappers spawn a JVM internally. See Critical Gotcha 1 in `SKILL.md`. +All methods additionally require **Java 11 or higher** at runtime (current +floor declared in `java/pom.xml` `maven.compiler.source`); the pip and npm +wrappers spawn a JVM internally. See Critical Gotcha 1 in `SKILL.md`. ## Post-Install Verification diff --git a/skills/odl-pdf/references/integration-examples.md b/skills/odl-pdf/references/integration-examples.md index 944584033..b7fb81585 100644 --- a/skills/odl-pdf/references/integration-examples.md +++ b/skills/odl-pdf/references/integration-examples.md @@ -2,7 +2,7 @@ Ready-to-run code for each supported interface. Load this file when the user asks for copy-pasteable examples in a specific language or framework. -Every path requires **Java 11+** at runtime — see `installation-matrix.md`. +Every path requires **Java 11+** at runtime (current floor per `java/pom.xml`). Language wrappers additionally require **Python 3.10+** (pip, per `pyproject.toml`) or **Node.js 20.19+** (npm, per `package.json`). See `installation-matrix.md` § Prerequisites for details. --- diff --git a/skills/odl-pdf/references/options-matrix.md b/skills/odl-pdf/references/options-matrix.md index 1b3e8e159..cdebfac87 100644 --- a/skills/odl-pdf/references/options-matrix.md +++ b/skills/odl-pdf/references/options-matrix.md @@ -1,6 +1,6 @@ # ODL-PDF CLI Options Matrix -This file contains a built-in summary of all 26 CLI options for the `opendataloader-pdf` tool. +This file contains a built-in summary of every CLI option for the `opendataloader-pdf` tool. If `options.json` is present in the project root, that file is the authoritative source — always prefer it over the descriptions here. This document exists so the agent skill can reason about options without loading the full JSON on every invocation, and adds **category groupings**,