Skip to content

Hybrid mode flattens heading levels (everything becomes H1) #441

@sadhikariSteep

Description

@sadhikariSteep

Bug

Description:

When using the hybrid="docling-fast" mode, the generated Markdown loses its hierarchical depth. Sub-sections that should be H3 or H4 are promoted to H1.

Example Comparison:

Standard Mode Output: #### 3.2 Document Processing Service (Correct depth)

Hybrid Mode Output: # 3.2 Document Processing Service (Flattened to Top-Level)

This is a known core limitation of Docling’s layout analysis, which tends to label visually prominent text as a top-level heading regardless of document depth.

Requested Fix:

Please implement a post-processing step to reconstruct the hierarchy. Suggested approaches:

1. Integrate docling-hierarchical-pdf Logic: Leverage the logic from the docling-hierarchical-pdf package, which uses font sizes and numerical indices (e.g., 1.1, 1.1.1) to restore proper depth.

2. Numbering Heuristic: If a heading starts with "X.X", automatically demote it to H2 or lower.

3. Global Level Offset: Provide a parameter (e.g., heading_offset=2) to allow users to manually shift all detected headings down by a fixed amount.

Why: Flat hierarchies break RAG chunking logic, making it impossible for the LLM to distinguish between a "Section Title" and a "Document Title."

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions