Skip to content

Fix: detect failed pages via error messages and gap detection union#434

Open
i-Veni-Vidi-Vici wants to merge 4 commits intoopendataloader-project:mainfrom
i-Veni-Vidi-Vici:fix/failed-page-detection
Open

Fix: detect failed pages via error messages and gap detection union#434
i-Veni-Vidi-Vici wants to merge 4 commits intoopendataloader-project:mainfrom
i-Veni-Vidi-Vici:fix/failed-page-detection

Conversation

@i-Veni-Vidi-Vici
Copy link
Copy Markdown

@i-Veni-Vidi-Vici i-Veni-Vidi-Vici commented Apr 16, 2026

Summary

  • Add _extract_failed_pages_from_errors() to parse failed page numbers from Docling error messages ("Page N: <error>")
  • Change build_conversion_response() to union both error-message parsing and gap detection strategies instead of using fallback
  • Docling may include failed pages as empty entries in the pages dict, making gap-only detection miss them. Combining both strategies ensures neither failure mode is lost.

Test plan

  • _extract_failed_pages_from_errors() unit tests (std::bad_alloc, mixed formats, no-match)
  • Integration test: both strategies combined — gap + error parsing returns union [2, 4, 5]
  • Integration test: duplicate page errors deduplicated
  • Integration test: no "Page N:" pattern falls back to gap detection
  • Integration test: empty errors list still detects gaps
  • All 48 existing tests pass (no regression)
  • Benchmark: NID 0.902, TEDS 0.489, MHS 0.740 — no regression

Related Issue

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved failed-page detection during PDF conversion by combining error-message parsing of "Page N:" entries with missing-page gap detection; failures are deduplicated, sorted, and correctly reported even when page entries are empty or the top-level pages key is missing.
  • Tests

    • Added tests covering error-string page extraction, gap detection, deduplication, boundary cases, combined failure reporting, and fallback behavior when error messages lack page markers.

Docling may include failed pages as empty entries in the pages dict,
making gap-only detection miss them. Add error message parsing
("Page N: <error>") and union both strategies so neither failure mode
is lost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 16, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Walkthrough

Added a helper to parse Docling-style "Page N:" error lines and changed partial-success response logic to compute failed pages as the deduplicated, sorted union of pages parsed from errors and pages inferred missing by gap-detection (using requested_pages, total_pages, or min/max fallbacks).

Changes

Cohort / File(s) Summary
Failed page detection
python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py
Added _extract_failed_pages_from_errors(errors: list[str]) -> list[int]. In build_conversion_response() (when status_value == "partial_success"), compute failed_pages as the sorted union of page numbers parsed from errors and gap-detected missing pages (inferred from requested_pages, total_pages, or min/max over pages), replacing the prior single-source gap-only computation.
Tests for parsing & union behavior
python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py
Imported the new helper and expanded tests: parsing Page N: lines (including non-matching lines and duplicates), treating all requested pages as failed when top-level pages key is missing, detecting gaps when pages contains empty placeholders, combining error-parsed failures with gap-detected failures (including overlap), and fallback behavior when errors lack the Page N: pattern or when errors is empty.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • MaximPlusov
  • LonelyMidoriya
  • hyunhee-jo
  • bundolee
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding dual-strategy failed page detection via error messages and gap detection.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py (1)

168-179: ⚠️ Potential issue | 🟡 Minor

Fix contradictory test docstring.

The test expects all requested pages to fail ([1, 2, 3]), but the docstring says it should produce empty failed_pages. Please align the docstring with the assertion.

Suggested docstring fix
-    """json_content without 'pages' key should produce empty failed_pages."""
+    """json_content without 'pages' key should mark all requested pages as failed."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py` around
lines 168 - 179, The test test_partial_success_missing_pages_key currently has a
docstring stating it "should produce empty failed_pages" which contradicts the
assertion expecting failed_pages == [1,2,3]; update the docstring to reflect the
actual expectation (e.g., state that when json_content lacks a 'pages' key, all
requested pages are considered failed and failed_pages should list the requested
pages) so it matches the assertion in the test that uses
build_conversion_response with requested_pages=(1, 3).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`:
- Around line 168-179: The test test_partial_success_missing_pages_key currently
has a docstring stating it "should produce empty failed_pages" which contradicts
the assertion expecting failed_pages == [1,2,3]; update the docstring to reflect
the actual expectation (e.g., state that when json_content lacks a 'pages' key,
all requested pages are considered failed and failed_pages should list the
requested pages) so it matches the assertion in the test that uses
build_conversion_response with requested_pages=(1, 3).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 792dbea6-795e-4307-94b8-a8c8aaf64e60

📥 Commits

Reviewing files that changed from the base of the PR and between 261aeea and 7d88b7b.

📒 Files selected for processing (2)
  • python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py
  • python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`:
- Around line 241-255: Add a test case to explicitly cover the overlap/dedup
scenario where a page appears both as a gap (missing from the "pages" dict in
build_conversion_response) and as an error-parsed entry so the union logic
deduplicates it; update or add to the test_both_strategies_combined test (or
create a new similar test) to include a page number present in the errors list
(e.g., "Page 3: ...") while also omitting that page key from json_content so
that response["failed_pages"] asserts the page appears only once in the final
list alongside other gap-only and error-only pages, verifying the union
semantics of failed_pages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ceef4f2-e02e-46ac-b6d1-9e43eed5ef98

📥 Commits

Reviewing files that changed from the base of the PR and between 7d88b7b and 8023c44.

📒 Files selected for processing (1)
  • python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py

i-Veni-Vidi-Vici and others added 2 commits April 16, 2026 11:29
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
@bundolee
Copy link
Copy Markdown
Contributor

Here's how the review process will go from here:

  1. CodeRabbit — all review threads resolved with a fix commit
  2. CI — all checks pass (codecov excluded)
  3. Component owner approval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants