fix: MarkdownTextSplitter produces too many chunks and alters markdown structured by majiayu000 · Pull Request #1452 · tmc/langchaingo

majiayu000 · 2025-12-30T08:56:06Z

Summary

This PR fixes #1439

Changes

Fixed consecutive header handling: Modified onMDHeader() in markdown_splitter.go to only call applyToChunks() when there's accumulated content (curSnippet != ""), preventing empty chunks from being created for consecutive headers
Corrected content-header association: Content is now correctly associated with its corresponding header instead of being prepended with the wrong header
Updated test expectations: Fixed test cases in markdown_splitter_test.go to reflect the correct behavior where consecutive headers are properly handled
Updated test data: Regenerated example_markdown_header_512.md to match the improved splitting output

Problem Solved

The original code would create empty chunks when encountering consecutive headers (e.g., ## Header 1 followed immediately by ### Header 2 without content). Additionally, it incorrectly associated content with the wrong headers, leading to:

Too many small, empty chunks
Content being prepended with incorrect header titles
Poor respect for the configured chunkSize

Solution

Changed the condition from checking hTitlePrepended to checking curSnippet != "". This ensures chunks are only created when there's actual content to flush, while still preserving the header hierarchy functionality.

Generated with Claude Code

…consecutive headers This fixes issue tmc#1439 where MarkdownTextSplitter was creating too many small chunks when consecutive headers appeared without content between them. The problem was that onMDHeader() unconditionally called applyToChunks() for every header, creating empty chunks for headers that had no content paragraphs under them. The fix checks if the current title has been prepended to any content (hTitlePrepended) before applying chunks. This prevents creating empty chunks while still maintaining the semantic structure of the markdown. Example: Before: # Title\n## Subtitle\n### Sub\nContent creates 3 chunks After: Same input creates 1 chunk with all headers and content Updated test expectations to match the new, more efficient behavior. Signed-off-by: majiayu000 <1835304752@qq.com>

Previously, MarkdownTextSplitter would create empty chunks when encountering consecutive headers without content between them. This led to too many small chunks and incorrect header associations where content would be prepended with the wrong header. The fix changes the logic in onMDHeader() to only call applyToChunks() when there's accumulated content (curSnippet != ""), rather than relying on the hTitlePrepended flag. This ensures: 1. No empty chunks are created for consecutive headers 2. Content is correctly associated with its corresponding header 3. Chunk sizes are respected more accurately Fixes tmc#1439 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: majiayu000 <1835304752@qq.com>

majiayu000 and others added 2 commits December 30, 2025 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: MarkdownTextSplitter produces too many chunks and alters markdown structured#1452

fix: MarkdownTextSplitter produces too many chunks and alters markdown structured#1452
majiayu000 wants to merge 2 commits intotmc:mainfrom
majiayu000:fix-1439-markdowntextsplitter-produces--1230-1631

majiayu000 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

majiayu000 commented Dec 30, 2025

Summary

Changes

Problem Solved

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant