Fix: detect failed pages via error messages and gap detection union#434
Fix: detect failed pages via error messages and gap detection union#434i-Veni-Vidi-Vici wants to merge 4 commits intoopendataloader-project:mainfrom
Conversation
Docling may include failed pages as empty entries in the pages dict,
making gap-only detection miss them. Add error message parsing
("Page N: <error>") and union both strategies so neither failure mode
is lost.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
WalkthroughAdded a helper to parse Docling-style "Page N:" error lines and changed partial-success response logic to compute failed pages as the deduplicated, sorted union of pages parsed from errors and pages inferred missing by gap-detection (using requested_pages, total_pages, or min/max fallbacks). Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py (1)
168-179:⚠️ Potential issue | 🟡 MinorFix contradictory test docstring.
The test expects all requested pages to fail (
[1, 2, 3]), but the docstring says it should produce emptyfailed_pages. Please align the docstring with the assertion.Suggested docstring fix
- """json_content without 'pages' key should produce empty failed_pages.""" + """json_content without 'pages' key should mark all requested pages as failed."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py` around lines 168 - 179, The test test_partial_success_missing_pages_key currently has a docstring stating it "should produce empty failed_pages" which contradicts the assertion expecting failed_pages == [1,2,3]; update the docstring to reflect the actual expectation (e.g., state that when json_content lacks a 'pages' key, all requested pages are considered failed and failed_pages should list the requested pages) so it matches the assertion in the test that uses build_conversion_response with requested_pages=(1, 3).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`:
- Around line 168-179: The test test_partial_success_missing_pages_key currently
has a docstring stating it "should produce empty failed_pages" which contradicts
the assertion expecting failed_pages == [1,2,3]; update the docstring to reflect
the actual expectation (e.g., state that when json_content lacks a 'pages' key,
all requested pages are considered failed and failed_pages should list the
requested pages) so it matches the assertion in the test that uses
build_conversion_response with requested_pages=(1, 3).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 792dbea6-795e-4307-94b8-a8c8aaf64e60
📒 Files selected for processing (2)
python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.pypython/opendataloader-pdf/tests/test_hybrid_server_partial_success.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py`:
- Around line 241-255: Add a test case to explicitly cover the overlap/dedup
scenario where a page appears both as a gap (missing from the "pages" dict in
build_conversion_response) and as an error-parsed entry so the union logic
deduplicates it; update or add to the test_both_strategies_combined test (or
create a new similar test) to include a page number present in the errors list
(e.g., "Page 3: ...") while also omitting that page key from json_content so
that response["failed_pages"] asserts the page appears only once in the final
list alongside other gap-only and error-only pages, verifying the union
semantics of failed_pages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6ceef4f2-e02e-46ac-b6d1-9e43eed5ef98
📒 Files selected for processing (1)
python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
|
Here's how the review process will go from here:
|
Summary
_extract_failed_pages_from_errors()to parse failed page numbers from Docling error messages ("Page N: <error>")build_conversion_response()to union both error-message parsing and gap detection strategies instead of using fallbackTest plan
_extract_failed_pages_from_errors()unit tests (std::bad_alloc, mixed formats, no-match)[2, 4, 5]Related Issue
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests