opendataloader-project
diff --git a/‎.claude-plugin/marketplace.json‎
Lines changed: 20 additions & 0 deletions b/‎.claude-plugin/marketplace.json‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎.github/workflows/skill-drift-check.yml‎
Lines changed: 34 additions & 0 deletions b/‎.github/workflows/skill-drift-check.yml‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 12 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 14 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 25 additions & 0 deletions b/‎README.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎skills/README.md‎
Lines changed: 166 additions & 0 deletions b/‎skills/README.md‎
Lines changed: 166 additions & 0 deletions
@@ -0,0 +1,20 @@
+{
+  "name": "opendataloader-pdf",
+  "owner": {
+    "name": "OpenDataLoader Project"
+  },
+  "metadata": {
+    "description": "AI-powered PDF extraction guidance and automation",
+    "version": "1.0.0"
+  },
+  "plugins": [
+    {
+      "name": "odl-pdf-skills",
+      "description": "Expert guidance for opendataloader-pdf — environment detection, option recommendations, hybrid mode setup, quality diagnostics, and direct conversion execution",
+      "source": "./",
+      "skills": [
+        "./skills/odl-pdf"
+      ]
+    }
+  ]
+}
@@ -0,0 +1,34 @@
+# skill-drift-check.yml
+# Ensures skill references stay in sync with options.json when CLI options change.
+# Runs sync-skill-refs.py and fails the check if drift is detected (exit code 1).
+
+name: Skill Drift Check
+
+on:
+  push:
+    paths:
+      - 'options.json'
+  pull_request:
+    paths:
+      - 'options.json'
+  workflow_dispatch:
+
+jobs:
+  check-drift:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+
+      - name: Check skill drift
+        run: |
+          python skills/odl-pdf/scripts/sync-skill-refs.py
+          if [ $? -ne 0 ]; then
+            echo ""
+            echo "Drift detected: skill references are out of sync with options.json."
+            echo "Run 'python skills/odl-pdf/scripts/sync-skill-refs.py --fix' locally to update them."
+            exit 1
+          fi
@@ -76,3 +76,4 @@ logs/
 .claude/settings.local.json
 .claude/plans/
 
+skills/odl-pdf/scripts/__pycache__/
@@ -21,3 +21,15 @@ Hidden text detection (`--filter-hidden-text`) is **off by default** — it requ
 - `./scripts/bench.sh --check-regression` — CI mode with threshold check
 - Benchmark code lives in [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench)
 - Metrics: **NID** (reading order), **TEDS** (table structure), **MHS** (heading structure), **Table Detection F1**, **Speed**
+
+## Agent Skills
+
+`skills/odl-pdf/` contains the public agent skill shipped with this project.
+
+When adding or changing CLI options in Java:
+1. Run `npm run sync` (regenerates options.json + Python/Node bindings)
+2. Update `skills/odl-pdf/references/options-matrix.md` with the new option
+3. CI (`skill-drift-check.yml`) will warn if step 2 is missed
+
+The skill is written in English for external users. Do not include internal
+team terminology or company-specific policies.
@@ -134,5 +134,19 @@ git commit -s -m "your message"
 
 Make sure your Git config contains your real name and email.
 
+## Agent Skills Maintenance
+
+This project ships a built-in agent skill at `skills/odl-pdf/`. When you add
+or modify CLI options:
+
+1. Run `npm run sync` as usual
+2. Update `skills/odl-pdf/references/options-matrix.md` — add the new option
+   to the appropriate category with its type, default, and description
+3. If the new option has interaction rules with existing options (e.g., requires
+   another option to be set), document the rule in the "Interaction Rules" section
+
+The CI workflow `skill-drift-check.yml` will flag any mismatch between
+`options.json` and `options-matrix.md`.
+
 Thank you again for helping us improve this project! 🙌
 If you have any questions, open an issue or join the discussion.
@@ -451,6 +451,31 @@ Existing PDFs (untagged)
 
 [PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)
 
+## Agent Skills
+
+Your AI coding agent knows how to use opendataloader-pdf — optimal options,
+hybrid mode setup, and quality diagnostics, all handled automatically.
+
+Works with **Claude Code**, **Codex**, **Gemini CLI**, **Cursor**, **VS Code**, and 26+ platforms via [agentskills.io](https://agentskills.io) spec.
+
+### What the Skill Does
+
+| Phase | Description |
+|-------|-------------|
+| **Discover** | Detects your OS, Java, Python, Node.js, and ODL installation |
+| **Prescribe** | Recommends optimal install method, options, format, and mode |
+| **Execute** | Generates ready-to-run commands or runs conversions directly |
+| **Diagnose** | Identifies quality issues and escalates (local → cluster → hybrid) |
+| **Optimize** | Tunes batch processing, RAG integration, and performance |
+
+### Install
+
+```bash
+npx skills add opendataloader-project/opendataloader-pdf --skill odl-pdf
+```
+
+Or use the `/odl-pdf` slash command in Claude Code after installing the plugin.
+
 ## Roadmap
 
 | Feature | Timeline | Tier |
 
@@ -0,0 +1,166 @@
+# Agent Skills
+
+opendataloader-pdf ships built-in agent skills that help AI coding assistants use this project effectively. Skills follow the [agentskills.io](https://agentskills.io) specification and work with Claude Code, Codex, Gemini CLI, Cursor, VS Code, and 26+ platforms.
+
+## Directory Structure
+
+```
+skills/
+├── README.md                          ← You are here
+└── odl-pdf/                           ← One skill per directory
+    ├── SKILL.md                       ← Main skill file (loaded when activated)
+    ├── references/                    ← Deep-dive docs (loaded on demand)
+    │   ├── options-matrix.md
+    │   ├── hybrid-guide.md
+    │   ├── format-guide.md
+    │   ├── installation-matrix.md
+    │   └── eval-metrics.md
+    ├── scripts/                       ← Executable helpers
+    │   ├── detect-env.sh
+    │   ├── hybrid-health.sh
+    │   ├── quick-eval.py
+    │   └── sync-skill-refs.py
+    └── evals/                         ← Quality test cases
+        └── evals.json
+```
+
+## How Skills Work
+
+### Progressive Disclosure (3 Levels)
+
+| Level | Content | When Loaded |
+|-------|---------|-------------|
+| **L1** | `description` field in SKILL.md frontmatter (~100 words) | Always visible to skill router |
+| **L2** | SKILL.md body (~400 lines) — persona, workflows, decision trees, gotchas | When skill is activated |
+| **L3** | `references/*` files — detailed option matrices, guides, metrics | When the user enters that topic |
+
+This design minimizes token usage. The AI agent only loads what it needs for the current task.
+
+### Dual-Path Option Reference
+
+Skills must work for **both** source-code users and pip-install users:
+
+- **Built-in summaries** (`references/options-matrix.md`): Always available, even without source code
+- **Dynamic reference** (`options.json`): Authoritative source when the source repo is available
+
+SKILL.md instructs the AI: "If `options.json` exists in this project, it is the source of truth. Options in `options.json` not found in `options-matrix.md` are newly added."
+
+## Creating a New Skill
+
+### 1. Create the Directory
+
+```
+skills/my-skill/
+├── SKILL.md
+├── references/       (optional)
+├── scripts/          (optional)
+└── evals/            (optional)
+```
+
+### 2. Write SKILL.md
+
+The SKILL.md file has two parts:
+
+**Frontmatter** (YAML between `---` markers):
+
+```yaml
+---
+name: my-skill
+description: >
+  One paragraph (~100 words) explaining what this skill does.
+  Include trigger keywords so the skill router knows when to activate.
+  Include "Do NOT use for:" to prevent false activations.
+---
+```
+
+**Body** (Markdown):
+
+- Define a persona (who the AI becomes when this skill is active)
+- Define a workflow (numbered phases the AI follows)
+- Include decision trees for common choices
+- List critical gotchas the AI must always warn about
+- Reference deeper docs with: "See `references/filename.md` for details"
+
+### 3. Write Evals
+
+Create `evals/evals.json` with test scenarios:
+
+```json
+{
+  "version": "1.0",
+  "skill": "my-skill",
+  "evals": [
+    {
+      "id": "eval-001",
+      "scenario": "Description of the user's situation",
+      "user_input": "What the user says",
+      "expected_recommendations": ["What the AI should recommend"],
+      "must_mention": ["Required terms in the response"],
+      "must_not_mention": ["Forbidden terms"]
+    }
+  ]
+}
+```
+
+### 4. Register in marketplace.json
+
+Add your skill to `.claude-plugin/marketplace.json`:
+
+```json
+{
+  "plugins": [{
+    "skills": ["./skills/odl-pdf", "./skills/my-skill"]
+  }]
+}
+```
+
+### 5. Test
+
+Test by spawning an AI agent that knows nothing about the project, loading only your SKILL.md, and asking it the eval scenarios. All `must_mention` terms should appear; no `must_not_mention` terms should appear.
+
+## Modifying the Existing Skill
+
+### When CLI Options Change
+
+1. Run `npm run sync` (regenerates `options.json`)
+2. Update `skills/odl-pdf/references/options-matrix.md` — add the new option to the appropriate category
+3. If the option has interaction rules, document them in the "Interaction Rules" section
+4. CI (`skill-drift-check.yml`) will catch any mismatch you miss
+
+### When Adding a New Hybrid Backend
+
+1. Update `skills/odl-pdf/references/hybrid-guide.md` — add to the Backend Registry table
+2. SKILL.md's decision tree says "check `options.json` for allowed hybrid values" — new backends are auto-discovered
+
+### When Adding a New Output Format
+
+1. Update `skills/odl-pdf/references/format-guide.md` — add to the format table with downstream use mapping
+2. The format list in `options.json` is auto-discovered by the skill
+
+## CI Integration
+
+### Drift Check (`skill-drift-check.yml`)
+
+Runs automatically when `options.json` changes. Compares option names in `options.json` against `options-matrix.md` and fails if they diverge.
+
+Run manually:
+
+```bash
+python skills/odl-pdf/scripts/sync-skill-refs.py
+```
+
+## Writing Guidelines
+
+- **Language**: English only (external open-source users)
+- **No internal terminology**: No company names, team names, or internal tool references
+- **Tone**: Senior engineer pair-programming — diagnose first, prescribe later
+- **Java guidance**: Always mention Java 11+ requirement. Never recommend specific JDK distributions or download links.
+- **Gotchas**: Only include gotchas that affect external users. Internal development gotchas belong in CLAUDE.md.
+
+## References
+
+- [agentskills.io specification](https://agentskills.io) — Multi-agent skill format standard
+- [Claude Code Skills](https://docs.anthropic.com/en/docs/claude-code) — Claude Code skill documentation
+- `.claude-plugin/marketplace.json` — Plugin registration for this project
+- `CLAUDE.md` — Internal development notes (not for the skill)
+- `CONTRIBUTING.md` — Contributor guidelines including skill maintenance
Original file line number	Diff line number	Diff line change
`@@ -76,3 +76,4 @@ logs/`
`76`	`76`	`.claude/settings.local.json`
`77`	`77`	`.claude/plans/`
`78`	`78`
	`79`	`+skills/odl-pdf/scripts/__pycache__/`