Fix: detect failed pages via error messages and gap detection union by i-Veni-Vidi-Vici · Pull Request #434 · opendataloader-project/opendataloader-pdf

i-Veni-Vidi-Vici · 2026-04-16T02:12:18Z

Summary

Add _extract_failed_pages_from_errors() to parse failed page numbers from Docling error messages ("Page N: <error>")
Change build_conversion_response() to union both error-message parsing and gap detection strategies instead of using fallback
Docling may include failed pages as empty entries in the pages dict, making gap-only detection miss them. Combining both strategies ensures neither failure mode is lost.

Test plan

_extract_failed_pages_from_errors() unit tests (std::bad_alloc, mixed formats, no-match)
Integration test: both strategies combined — gap + error parsing returns union [2, 4, 5]
Integration test: duplicate page errors deduplicated
Integration test: no "Page N:" pattern falls back to gap detection
Integration test: empty errors list still detects gaps
All 48 existing tests pass (no regression)
Benchmark: NID 0.902, TEDS 0.489, MHS 0.740 — no regression

Related Issue

Error: std::bad_alloc persists in v2.2.1 #400

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved failed-page detection during PDF conversion by combining error-message parsing of "Page N:" entries with missing-page gap detection; failures are deduplicated, sorted, and correctly reported even when page entries are empty or the top-level pages key is missing.
Tests
- Added tests covering error-string page extraction, gap detection, deduplication, boundary cases, combined failure reporting, and fallback behavior when error messages lack page markers.

Docling may include failed pages as empty entries in the pages dict, making gap-only detection miss them. Add error message parsing ("Page N: <error>") and union both strategies so neither failure mode is lost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>

CLAassistant · 2026-04-16T02:12:24Z

All committers have signed the CLA.

coderabbitai · 2026-04-16T02:12:38Z

Walkthrough

Added a helper to parse Docling-style "Page N:" error lines and changed partial-success response logic to compute failed pages as the deduplicated, sorted union of pages parsed from errors and pages inferred missing by gap-detection (using requested_pages, total_pages, or min/max fallbacks).

Changes

Cohort / File(s)	Summary
Failed page detection `python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py`	Added `_extract_failed_pages_from_errors(errors: list[str]) -> list[int]`. In `build_conversion_response()` (when `status_value == "partial_success"`), compute `failed_pages` as the sorted union of page numbers parsed from `errors` and gap-detected missing pages (inferred from `requested_pages`, `total_pages`, or min/max over `pages`), replacing the prior single-source gap-only computation.
Tests for parsing & union behavior `python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`	Imported the new helper and expanded tests: parsing `Page N:` lines (including non-matching lines and duplicates), treating all requested pages as failed when top-level `pages` key is missing, detecting gaps when `pages` contains empty placeholders, combining error-parsed failures with gap-detected failures (including overlap), and fallback behavior when errors lack the `Page N:` pattern or when `errors` is empty.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

MaximPlusov
LonelyMidoriya
hyunhee-jo
bundolee

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding dual-strategy failed page detection via error messages and gap detection.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py (1)

168-179: ⚠️ Potential issue | 🟡 Minor

Fix contradictory test docstring.

The test expects all requested pages to fail ([1, 2, 3]), but the docstring says it should produce empty failed_pages. Please align the docstring with the assertion.

Suggested docstring fix

-    """json_content without 'pages' key should produce empty failed_pages."""
+    """json_content without 'pages' key should mark all requested pages as failed."""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py` around
lines 168 - 179, The test test_partial_success_missing_pages_key currently has a
docstring stating it "should produce empty failed_pages" which contradicts the
assertion expecting failed_pages == [1,2,3]; update the docstring to reflect the
actual expectation (e.g., state that when json_content lacks a 'pages' key, all
requested pages are considered failed and failed_pages should list the requested
pages) so it matches the assertion in the test that uses
build_conversion_response with requested_pages=(1, 3).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`:
- Around line 168-179: The test test_partial_success_missing_pages_key currently
has a docstring stating it "should produce empty failed_pages" which contradicts
the assertion expecting failed_pages == [1,2,3]; update the docstring to reflect
the actual expectation (e.g., state that when json_content lacks a 'pages' key,
all requested pages are considered failed and failed_pages should list the
requested pages) so it matches the assertion in the test that uses
build_conversion_response with requested_pages=(1, 3).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 792dbea6-795e-4307-94b8-a8c8aaf64e60

📥 Commits

Reviewing files that changed from the base of the PR and between 261aeea and 7d88b7b.

📒 Files selected for processing (2)

python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py
python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`:
- Around line 241-255: Add a test case to explicitly cover the overlap/dedup
scenario where a page appears both as a gap (missing from the "pages" dict in
build_conversion_response) and as an error-parsed entry so the union logic
deduplicates it; update or add to the test_both_strategies_combined test (or
create a new similar test) to include a page number present in the errors list
(e.g., "Page 3: ...") while also omitting that page key from json_content so
that response["failed_pages"] asserts the page appears only once in the final
list alongside other gap-only and error-only pages, verifying the union
semantics of failed_pages.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ceef4f2-e02e-46ac-b6d1-9e43eed5ef98

📥 Commits

Reviewing files that changed from the base of the PR and between 7d88b7b and 8023c44.

📒 Files selected for processing (1)

python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>

bundolee · 2026-04-17T13:37:05Z

Here's how the review process will go from here:

CodeRabbit — all review threads resolved with a fix commit
CI — all checks pass (codecov excluded)
Component owner approval
- Java (java/) → @MaximPlusov
- Others → @bundolee

i-Veni-Vidi-Vici requested review from LonelyMidoriya, MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners April 16, 2026 02:12

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Fix: correct misleading docstring in missing pages key test

8023c44

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py

i-Veni-Vidi-Vici and others added 2 commits April 16, 2026 11:29

Add: overlap dedup test for gap and error parsing union

d82dc27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>

Merge upstream/main: resolve conflict keeping both functions

9913754

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: detect failed pages via error messages and gap detection union#434

Fix: detect failed pages via error messages and gap detection union#434
i-Veni-Vidi-Vici wants to merge 4 commits intoopendataloader-project:mainfrom
i-Veni-Vidi-Vici:fix/failed-page-detection

i-Veni-Vidi-Vici commented Apr 16, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

bundolee commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

i-Veni-Vidi-Vici commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Related Issue

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bundolee commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

i-Veni-Vidi-Vici commented Apr 16, 2026 •

edited

Loading

CLAassistant commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading