Bug
Description:
When using the hybrid="docling-fast" mode, the generated Markdown loses its hierarchical depth. Sub-sections that should be H3 or H4 are promoted to H1.
Example Comparison:
Standard Mode Output: #### 3.2 Document Processing Service (Correct depth)
Hybrid Mode Output: # 3.2 Document Processing Service (Flattened to Top-Level)
This is a known core limitation of Docling’s layout analysis, which tends to label visually prominent text as a top-level heading regardless of document depth.
Requested Fix:
Please implement a post-processing step to reconstruct the hierarchy. Suggested approaches:
1. Integrate docling-hierarchical-pdf Logic: Leverage the logic from the docling-hierarchical-pdf package, which uses font sizes and numerical indices (e.g., 1.1, 1.1.1) to restore proper depth.
2. Numbering Heuristic: If a heading starts with "X.X", automatically demote it to H2 or lower.
3. Global Level Offset: Provide a parameter (e.g., heading_offset=2) to allow users to manually shift all detected headings down by a fixed amount.
Why: Flat hierarchies break RAG chunking logic, making it impossible for the LLM to distinguish between a "Section Title" and a "Document Title."
Bug
Description:
When using the hybrid="docling-fast" mode, the generated Markdown loses its hierarchical depth. Sub-sections that should be H3 or H4 are promoted to H1.
Example Comparison:
This is a known core limitation of Docling’s layout analysis, which tends to label visually prominent text as a top-level heading regardless of document depth.
Requested Fix:
Please implement a post-processing step to reconstruct the hierarchy. Suggested approaches:
Why: Flat hierarchies break RAG chunking logic, making it impossible for the LLM to distinguish between a "Section Title" and a "Document Title."