textsplitter: add MarkdownHeaderTextSplitter for hierarchical markdown parsing#1488
Open
kosuriabhishek767 wants to merge 2 commits intotmc:mainfrom
Open
textsplitter: add MarkdownHeaderTextSplitter for hierarchical markdown parsing#1488kosuriabhishek767 wants to merge 2 commits intotmc:mainfrom
kosuriabhishek767 wants to merge 2 commits intotmc:mainfrom
Conversation
added 2 commits
April 1, 2026 09:09
- Implements header-based markdown splitting with hierarchical metadata
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Checklist (All Items Completed ✓)
textsplitter: add MarkdownHeaderTextSplitter for hierarchical markdown parsinggolangci-lintchecks.PR Description
Summary
Adds
MarkdownHeaderTextSplitterfor structure-aware markdown parsing with hierarchical metadata. Enables RAG systems to filter search results by document sections, improving retrieval precision for large technical documentation.Motivation
When working with large markdown documents (technical manuals, API documentation, knowledge bases), users need to retrieve information from specific sections. Current text splitters don't preserve the hierarchical structure of markdown headers, making section-based filtering impossible.
Use Case:
What This PR Solves
Adds structure-aware markdown parsing with:
header_pathshowing position in document structure#characters inside code fences (both ``` and ~~~)/)Source of Concepts
Inspired by LlamaIndex's MarkdownNodeParser, adapted to Go and langchaingo's architecture.
Key Design Decisions:
header_pathcontains only parent headersChanges
New Files:
textsplitter/markdown_header_splitter.go(~170 lines)MarkdownHeaderTextSplittertype with Options patternNewMarkdownHeaderTextSplitter()constructor with sensible defaultsSplitText()andSplitTextToDocuments()implementationstextsplitter/markdown_header_splitter_test.go(~400 lines)Example Usage:
Testing
All 21 tests pass, covering:
Run tests:
Performance
strings.Builderfor efficient string concatenationReferences
Breaking Changes
None. This is a new feature with no impact on existing code.