Skip to content

textsplitter: add MarkdownHeaderTextSplitter for hierarchical markdown parsing#1488

Open
kosuriabhishek767 wants to merge 2 commits intotmc:mainfrom
kosuriabhishek767:feat/markdown-header-splitter
Open

textsplitter: add MarkdownHeaderTextSplitter for hierarchical markdown parsing#1488
kosuriabhishek767 wants to merge 2 commits intotmc:mainfrom
kosuriabhishek767:feat/markdown-header-splitter

Conversation

@kosuriabhishek767
Copy link
Copy Markdown

PR Checklist (All Items Completed ✓)

  • Read the Contributing documentation.
  • Read the Code of conduct documentation.
  • Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages.
    • Title: textsplitter: add MarkdownHeaderTextSplitter for hierarchical markdown parsing
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
    • Verified: No existing PR implements header-based markdown splitting with hierarchical metadata
  • Provide a description in this PR that addresses what the PR is solving.
    • See detailed description below
  • Describes the source of new concepts.
    • Inspired by LlamaIndex's MarkdownNodeParser
  • References existing implementations as appropriate.
    • References LlamaIndex implementation in code comments and PR description
  • Contains test coverage for new functions.
    • 21 comprehensive tests covering all functionality and edge cases
  • Passes all golangci-lint checks.
    • Code follows project style conventions, minimal comments, proper error handling

PR Description

Summary

Adds MarkdownHeaderTextSplitter for structure-aware markdown parsing with hierarchical metadata. Enables RAG systems to filter search results by document sections, improving retrieval precision for large technical documentation.

Motivation

When working with large markdown documents (technical manuals, API documentation, knowledge bases), users need to retrieve information from specific sections. Current text splitters don't preserve the hierarchical structure of markdown headers, making section-based filtering impossible.

Use Case:

// User query: "How do I configure authentication?"
// With header metadata, filter to only search within:
// "/Getting Started/Configuration/Authentication/"

What This PR Solves

Adds structure-aware markdown parsing with:

  1. Header-based splitting: Documents split at markdown headers (# through ######)
  2. Hierarchical metadata: Each chunk includes header_path showing position in document structure
  3. Parent-only paths: Metadata contains parent headers only (not current header), matching LlamaIndex behavior
  4. Code block awareness: Properly handles # characters inside code fences (both ``` and ~~~)
  5. Configurable separators: Customizable header path separator (default: /)

Source of Concepts

Inspired by LlamaIndex's MarkdownNodeParser, adapted to Go and langchaingo's architecture.

Key Design Decisions:

  • Parent-only metadata: Matches LlamaIndex where header_path contains only parent headers
  • CommonMark compliance: Supports both ``` and ~~~ code fence styles

Changes

New Files:

  • textsplitter/markdown_header_splitter.go (~170 lines)

    • MarkdownHeaderTextSplitter type with Options pattern
    • NewMarkdownHeaderTextSplitter() constructor with sensible defaults
    • SplitText() and SplitTextToDocuments() implementations
    • Package-level regex compilation for performance
  • textsplitter/markdown_header_splitter_test.go (~400 lines)

    • 21 comprehensive tests covering all functionality and edge cases

Example Usage:

splitter := textsplitter.NewMarkdownHeaderTextSplitter()
docs, err := splitter.SplitTextToDocuments(markdown)

// Each document has hierarchical metadata:
// docs[0].Metadata["header_path"] = "/Chapter 1/"
// docs[1].Metadata["header_path"] = "/Chapter 1/Section 1.1/"

Testing

All 21 tests pass, covering:

  • Basic splitting and hierarchy (6 tests)
  • Code block handling - both ``` and ~~~ (3 tests)
  • Edge cases - separator collision, empty sections (3 tests)
  • Configuration options (3 tests)
  • Real-world documentation scenarios (6 tests)

Run tests:

cd textsplitter
go test -v -run TestMarkdownHeaderTextSplitter

Performance

  • Regex patterns compiled once at package level
  • Uses strings.Builder for efficient string concatenation
  • Minimal allocations, reuses header stack

References

Breaking Changes

None. This is a new feature with no impact on existing code.

Abhishek kosuri added 2 commits April 1, 2026 09:09
- Implements header-based markdown splitting with hierarchical metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant