Skip to content

[BUG] StackOverflowError in Pattern$BmpCharPredicate.union() caused by malformed List structure in tagged PDF #423

@ueno-labo

Description

@ueno-labo

Bug Description

When converting certain EtherCAT specification PDFs (ETG1000 Part 6, ETG1005) to markdown format, the JAR crashes with a java.lang.StackOverflowError deep inside java.util.regex.Pattern$BmpCharPredicate.lambda$union$2.

This appears to be triggered by BulletedParagraphUtils.isLabeledLine() building a regex character class from list label characters. The PDF has a malformed tagged structure where some /LI (list items) are connected to multiple /L (list) parents — which the JAR itself warns about as "List item is connected with different lists". This causes the same characters to be added to the regex union repeatedly, leading to an infinitely deep union() call chain.

Note: This is different from #415 and #417. Those are recursive Java method calls. This one is infinite recursion inside the JVM regex engine itself (Pattern$BmpCharPredicate.union()), so increasing -Xss does not help.

Steps to Reproduce

  1. pip install opendataloader-pdf==2.2.1
  2. Convert ETG1000_6_V1i0i4_S_R_EcatALProtocols.pdf (EtherCAT Specification Part 6, publicly available from ethercat.org)
  3. Run:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions