Skip to content

fix: MarkdownTextSplitter produces too many chunks and alters markdown structured#1452

Open
majiayu000 wants to merge 2 commits intotmc:mainfrom
majiayu000:fix-1439-markdowntextsplitter-produces--1230-1631
Open

fix: MarkdownTextSplitter produces too many chunks and alters markdown structured#1452
majiayu000 wants to merge 2 commits intotmc:mainfrom
majiayu000:fix-1439-markdowntextsplitter-produces--1230-1631

Conversation

@majiayu000
Copy link
Copy Markdown
Contributor

Summary

This PR fixes #1439

Changes

  • Fixed consecutive header handling: Modified onMDHeader() in markdown_splitter.go to only call applyToChunks() when there's accumulated content (curSnippet != ""), preventing empty chunks from being created for consecutive headers
  • Corrected content-header association: Content is now correctly associated with its corresponding header instead of being prepended with the wrong header
  • Updated test expectations: Fixed test cases in markdown_splitter_test.go to reflect the correct behavior where consecutive headers are properly handled
  • Updated test data: Regenerated example_markdown_header_512.md to match the improved splitting output

Problem Solved

The original code would create empty chunks when encountering consecutive headers (e.g., ## Header 1 followed immediately by ### Header 2 without content). Additionally, it incorrectly associated content with the wrong headers, leading to:

  • Too many small, empty chunks
  • Content being prepended with incorrect header titles
  • Poor respect for the configured chunkSize

Solution

Changed the condition from checking hTitlePrepended to checking curSnippet != "". This ensures chunks are only created when there's actual content to flush, while still preserving the header hierarchy functionality.


Generated with Claude Code

majiayu000 and others added 2 commits December 30, 2025 16:40
…consecutive headers

This fixes issue tmc#1439 where MarkdownTextSplitter was creating too many
small chunks when consecutive headers appeared without content between them.

The problem was that onMDHeader() unconditionally called applyToChunks()
for every header, creating empty chunks for headers that had no content
paragraphs under them.

The fix checks if the current title has been prepended to any content
(hTitlePrepended) before applying chunks. This prevents creating empty
chunks while still maintaining the semantic structure of the markdown.

Example:
Before: # Title\n## Subtitle\n### Sub\nContent creates 3 chunks
After: Same input creates 1 chunk with all headers and content

Updated test expectations to match the new, more efficient behavior.

Signed-off-by: majiayu000 <1835304752@qq.com>
Previously, MarkdownTextSplitter would create empty chunks when encountering
consecutive headers without content between them. This led to too many small
chunks and incorrect header associations where content would be prepended with
the wrong header.

The fix changes the logic in onMDHeader() to only call applyToChunks() when
there's accumulated content (curSnippet != ""), rather than relying on the
hTitlePrepended flag. This ensures:

1. No empty chunks are created for consecutive headers
2. Content is correctly associated with its corresponding header
3. Chunk sizes are respected more accurately

Fixes tmc#1439

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: majiayu000 <1835304752@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MarkdownTextSplitter produces too many chunks and alters markdown structured

1 participant