Skip to content

[BUG] StackOverflowError in HeadingProcessor.setHeadings on PDFs with nested list structures #417

@juechen703

Description

@juechen703

Bug Description

When converting certain Chinese annual report PDFs (Wuliangye 000858) to markdown or JSON format, HeadingProcessor.setHeadings enters an infinite
recursion causing a StackOverflowError.

Root Cause (bytecode analysis):

The setHeadings method uses an Enhanced for loop over getListItems():

  for (IObject item : getListItems()) {
      if (item instanceof TableBorder) {
          setHeadings(cellContents);  // ← recursive call
      }
  }

Each iteration calls addTextNode() on the list's children. Since the Enhanced for loop calls getListItems() on every iteration (not pre-fetched), it
sees the growing list and never terminates.

Affected pages: Pages with deeply nested list structures (e.g., shareholder/equity structure diagrams in Chinese annual reports).

Steps to Reproduce

  1. Install: pip install opendataloader-pdf==2.2.1
  2. Run:
  opendataloader-pdf 000858.SH-2016-annual.pdf -f markdown -o output --pages 33

Expected: Page 33 extracted successfully
Actual: java.lang.StackOverflowError at HeadingProcessor.setHeadings(HeadingProcessor.java:182)

Second sample: Page 34 of 000858.SH-2017-annual.pdf triggers the same bug.

Error Traceback

  java.lang.StackOverflowError
      at org.opendataloader.pdf.processors.HeadingProcessor.setHeadings(HeadingProcessor.java:171)
      at org.opendataloader.pdf.processors.HeadingProcessor.setHeadings(HeadingProcessor.java:182)  ← infinite recursion
      ... (repeats until stack overflow)

Version

  • opendataloader-pdf: 2.2.1
  • JAR: opendataloader-pdf-cli.jar (23MB, built 2026-04-03)

Java Version

  openjdk 17.0.18 2026-01-20
  OpenJDK Runtime Environment (build 17.0.18+8-Ubuntu-124.04.1)

Suggested Fix

  // Before (buggy):
  for (IObject item : getListItems()) { ... }

  // After (fixed):
  List<IObject> items = new ArrayList<>(getListItems());
  for (IObject item : items) { ... }

Workaround

Using -f text format bypasses HeadingProcessor and extracts successfully.

Attachments

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions