Description
When parsing PDF documents, if specific pages fail to parse (e.g., due to exceptions caught in the pipeline) and are excluded from the doc.pages list, the export functions (export_to_doctags, export_to_html, export_to_markdown) do not generate page break markers for these gaps. This causes page count mismatch issues in downstream processing.
Steps to Reproduce
Attachments
test.pdf
test.pdf - A minimal reproduction file derived from a larger document. The content has been intentionally corrupted for confidentiality, but the parsing error still reproduces.
* Structure: Consists of 4 pages (Pages 78, 79, 83, 84).
* Behavior: Parsing succeeds for pages 78 and 84, but fails for pages 79 and 83.
1. Convert the attached test.pdf using DocumentConverter.
2. Export using export_to_doctags(), export_to_markdown(page_break_placeholder=str), export_to_html(split_page_view=True).
3. Count the page break
Reproduction Script:
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
),
},
)
doc = converter.convert(source).document
doctags_output = doc.export_to_doctags()
markdown_output = doc.export_to_markdown(page_break_placeholder="===PAGE_BREAK===")
html_output = doc.export_to_html(split_page_view=True)
print(doctags_output)
print(markdown_output)
print(html_output)
Expected Behavior
- Export: export fuction should generate
pagebreak tags for all pages, including failed ones, so that page numbering remains consistent.
Actual Behavior
Missing page break for failed pages:
If pages 1, 2, 4, 5 succeed but page 3 fails, the output currently looks like this:
... content ... 1page
<page_break>
... content ... 2page
<page_break>
... content ... 4page
<page_break>
... content ... 5page
... content ... 1page
===PAGE_BREAK===
... content ... 2page
===PAGE_BREAK===
... content ... 4page
===PAGE_BREAK===
... content ... 5page
<td>
<div class="page">
... content ... 1page
</div>
</td>
<td>
<div class="page">
... content ... 2page
</div>
</td>
<td>
<div class="page">
... content ... 4page
</div>
</td>
<td>
<div class="page">
... content ... 5page
</div>
</td>
Environment
| Component |
Version / Details |
| docling version |
2.31.1 |
| docling-core version |
2.31.0 |
| Python |
3.11 |
| OS |
macOS |
Description
When parsing PDF documents, if specific pages fail to parse (e.g., due to exceptions caught in the pipeline) and are excluded from the
doc.pageslist, the export functions (export_to_doctags,export_to_html,export_to_markdown) do not generatepage breakmarkers for these gaps. This causes page count mismatch issues in downstream processing.Steps to Reproduce
Attachments
test.pdf
test.pdf- A minimal reproduction file derived from a larger document. The content has been intentionally corrupted for confidentiality, but the parsing error still reproduces.* Structure: Consists of 4 pages (Pages 78, 79, 83, 84).
* Behavior: Parsing succeeds for pages 78 and 84, but fails for pages 79 and 83.
1. Convert the attached
test.pdfusingDocumentConverter.2. Export using
export_to_doctags(),export_to_markdown(page_break_placeholder=str),export_to_html(split_page_view=True).3. Count the
page breakReproduction Script:
Expected Behavior
pagebreaktags for all pages, including failed ones, so that page numbering remains consistent.Actual Behavior
Missing
page breakfor failed pages:If pages 1, 2, 4, 5 succeed but page 3 fails, the output currently looks like this:
Environment