Bug Description
When converting certain Chinese annual report PDFs (Wuliangye 000858) to markdown or JSON format, HeadingProcessor.setHeadings enters an infinite
recursion causing a StackOverflowError.
Root Cause (bytecode analysis):
The setHeadings method uses an Enhanced for loop over getListItems():
for (IObject item : getListItems()) {
if (item instanceof TableBorder) {
setHeadings(cellContents); // ← recursive call
}
}
Each iteration calls addTextNode() on the list's children. Since the Enhanced for loop calls getListItems() on every iteration (not pre-fetched), it
sees the growing list and never terminates.
Affected pages: Pages with deeply nested list structures (e.g., shareholder/equity structure diagrams in Chinese annual reports).
Steps to Reproduce
- Install: pip install opendataloader-pdf==2.2.1
- Run:
opendataloader-pdf 000858.SH-2016-annual.pdf -f markdown -o output --pages 33
Expected: Page 33 extracted successfully
Actual: java.lang.StackOverflowError at HeadingProcessor.setHeadings(HeadingProcessor.java:182)
Second sample: Page 34 of 000858.SH-2017-annual.pdf triggers the same bug.
Error Traceback
java.lang.StackOverflowError
at org.opendataloader.pdf.processors.HeadingProcessor.setHeadings(HeadingProcessor.java:171)
at org.opendataloader.pdf.processors.HeadingProcessor.setHeadings(HeadingProcessor.java:182) ← infinite recursion
... (repeats until stack overflow)
Version
- opendataloader-pdf: 2.2.1
- JAR: opendataloader-pdf-cli.jar (23MB, built 2026-04-03)
Java Version
openjdk 17.0.18 2026-01-20
OpenJDK Runtime Environment (build 17.0.18+8-Ubuntu-124.04.1)
Suggested Fix
// Before (buggy):
for (IObject item : getListItems()) { ... }
// After (fixed):
List<IObject> items = new ArrayList<>(getListItems());
for (IObject item : items) { ... }
Workaround
Using -f text format bypasses HeadingProcessor and extracts successfully.
Attachments
Bug Description
When converting certain Chinese annual report PDFs (Wuliangye 000858) to markdown or JSON format, HeadingProcessor.setHeadings enters an infinite
recursion causing a StackOverflowError.
Root Cause (bytecode analysis):
The setHeadings method uses an Enhanced for loop over getListItems():
Each iteration calls addTextNode() on the list's children. Since the Enhanced for loop calls getListItems() on every iteration (not pre-fetched), it
sees the growing list and never terminates.
Affected pages: Pages with deeply nested list structures (e.g., shareholder/equity structure diagrams in Chinese annual reports).
Steps to Reproduce
Expected: Page 33 extracted successfully
Actual: java.lang.StackOverflowError at HeadingProcessor.setHeadings(HeadingProcessor.java:182)
Second sample: Page 34 of 000858.SH-2017-annual.pdf triggers the same bug.
Error Traceback
Version
Java Version
Suggested Fix
Workaround
Using -f text format bypasses HeadingProcessor and extracts successfully.
Attachments
000858.SH-2016-annual.pdf
Wuliangye 2016 annual report (page 33 triggers bug)
000858.SH-2017-annual.pdf
Wuliangye 2017 annual report (page 34 triggers bug)