Bug Description
When converting certain EtherCAT specification PDFs (ETG1000 Part 6, ETG1005) to markdown format, the JAR crashes with a java.lang.StackOverflowError deep inside java.util.regex.Pattern$BmpCharPredicate.lambda$union$2.
This appears to be triggered by BulletedParagraphUtils.isLabeledLine() building a regex character class from list label characters. The PDF has a malformed tagged structure where some /LI (list items) are connected to multiple /L (list) parents — which the JAR itself warns about as "List item is connected with different lists". This causes the same characters to be added to the regex union repeatedly, leading to an infinitely deep union() call chain.
Note: This is different from #415 and #417. Those are recursive Java method calls. This one is infinite recursion inside the JVM regex engine itself (Pattern$BmpCharPredicate.union()), so increasing -Xss does not help.
Steps to Reproduce
pip install opendataloader-pdf==2.2.1
- Convert
ETG1000_6_V1i0i4_S_R_EcatALProtocols.pdf (EtherCAT Specification Part 6, publicly available from ethercat.org)
- Run:
Bug Description
When converting certain EtherCAT specification PDFs (ETG1000 Part 6, ETG1005) to markdown format, the JAR crashes with a
java.lang.StackOverflowErrordeep insidejava.util.regex.Pattern$BmpCharPredicate.lambda$union$2.This appears to be triggered by
BulletedParagraphUtils.isLabeledLine()building a regex character class from list label characters. The PDF has a malformed tagged structure where some/LI(list items) are connected to multiple/L(list) parents — which the JAR itself warns about as "List item is connected with different lists". This causes the same characters to be added to the regex union repeatedly, leading to an infinitely deepunion()call chain.Note: This is different from #415 and #417. Those are recursive Java method calls. This one is infinite recursion inside the JVM regex engine itself (
Pattern$BmpCharPredicate.union()), so increasing-Xssdoes not help.Steps to Reproduce
pip install opendataloader-pdf==2.2.1ETG1000_6_V1i0i4_S_R_EcatALProtocols.pdf(EtherCAT Specification Part 6, publicly available from ethercat.org)