fix: MarkdownTextSplitter produces too many chunks and alters markdown structured#1452
Open
majiayu000 wants to merge 2 commits intotmc:mainfrom
Open
fix: MarkdownTextSplitter produces too many chunks and alters markdown structured#1452majiayu000 wants to merge 2 commits intotmc:mainfrom
majiayu000 wants to merge 2 commits intotmc:mainfrom
Conversation
…consecutive headers This fixes issue tmc#1439 where MarkdownTextSplitter was creating too many small chunks when consecutive headers appeared without content between them. The problem was that onMDHeader() unconditionally called applyToChunks() for every header, creating empty chunks for headers that had no content paragraphs under them. The fix checks if the current title has been prepended to any content (hTitlePrepended) before applying chunks. This prevents creating empty chunks while still maintaining the semantic structure of the markdown. Example: Before: # Title\n## Subtitle\n### Sub\nContent creates 3 chunks After: Same input creates 1 chunk with all headers and content Updated test expectations to match the new, more efficient behavior. Signed-off-by: majiayu000 <1835304752@qq.com>
Previously, MarkdownTextSplitter would create empty chunks when encountering consecutive headers without content between them. This led to too many small chunks and incorrect header associations where content would be prepended with the wrong header. The fix changes the logic in onMDHeader() to only call applyToChunks() when there's accumulated content (curSnippet != ""), rather than relying on the hTitlePrepended flag. This ensures: 1. No empty chunks are created for consecutive headers 2. Content is correctly associated with its corresponding header 3. Chunk sizes are respected more accurately Fixes tmc#1439 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: majiayu000 <1835304752@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes #1439
Changes
onMDHeader()inmarkdown_splitter.goto only callapplyToChunks()when there's accumulated content (curSnippet != ""), preventing empty chunks from being created for consecutive headersmarkdown_splitter_test.goto reflect the correct behavior where consecutive headers are properly handledexample_markdown_header_512.mdto match the improved splitting outputProblem Solved
The original code would create empty chunks when encountering consecutive headers (e.g.,
## Header 1followed immediately by### Header 2without content). Additionally, it incorrectly associated content with the wrong headers, leading to:chunkSizeSolution
Changed the condition from checking
hTitlePrependedto checkingcurSnippet != "". This ensures chunks are only created when there's actual content to flush, while still preserving the header hierarchy functionality.Generated with Claude Code