Skip to content

Search in files and Find do not work with non-Latin characters #3892

@GBR-613

Description

@GBR-613

Problem description

In my Eclipse based product (sorry it is not open source), we found that in particular scenarios (such as "whole word only") Search in files and Find do not work with non-Latin characters with JDK later than v21.
To implement "whole word only", Eclipse code created a regex with the text surrounded by "\b". It is not processed correctly.
We assumed that the problem is in Java SDK and opened a ticket for them, but it was rejected with the following explanation:

The '\b' meta character behaviour has been changed in Semeru 21 or later versions. The change has been implemented because of the below OpenJDK issue.

OpenJDK Issue and Java 19 Release Note:
-----------------------------------------------------

[https://bugs.openjdk.org/browse/JDK-8282129](https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.org_browse_JDK-2D8282129&d=DwQCaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=fhaoxtVu0e-iwX8nTZK86CwLEQtEiHmql8Am2TrpEK0&m=H0KLuEUYnT2W6zyi-I9erLSamgGyMUIx8jd872LgBl6Fxu5TvjUErd3pRBZT9FfQ&s=xW2STSDLKBbXNKOOMdqIGA3uJTJVGYmyVnhEDKEZxWU&e=)

[https://www.oracle.com/java/technologies/javase/19-relnote-issues.html#JDK-8264160](https://urldefense.proofpoint.com/v2/url?u=https-3A__www.oracle.com_java_technologies_javase_19-2Drelnote-2Dissues.html-23JDK-2D8264160&d=DwQCaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=fhaoxtVu0e-iwX8nTZK86CwLEQtEiHmql8Am2TrpEK0&m=H0KLuEUYnT2W6zyi-I9erLSamgGyMUIx8jd872LgBl6Fxu5TvjUErd3pRBZT9FfQ&s=9YDXr169fPzu3GN1s9KGaGhBBJXFuabt2Pq5jceny18&e=)

In Semeru 18 or earlier version, The \b was by default unicode aware. So it was able to process the Hebrew characters successfully. 

However in Semeru 21 or later version, the \b meta character matches ASCII word characters by default. To match Hebrew string, the UNICODE_CHARACTER_CLASS must be set because it contains unicode characters.

How to set UNICODE_CHARACTER_CLASS ?
-----------------------------------------------------------------

Pattern pattern = Pattern.compile(findString, Pattern.UNICODE_CHARACTER_CLASS);

Reason behind the change:
------------------------------------------

In Semeru 18 or earlier versions, the \b (word boundary) behaviour was inconsistent with \w (word character) behaviour. The \w (word character) matches [a-zA-Z_0-9] in the absence of UNICODE_CHARACTER_CLASS being set. However the \b relies on j.l.Character.isLetterOrDigit along with a check for underscores and isLetterOrDigit method matches some unicode characters in addition to the range specified by \w (word character). However when UNICODE_CHARACTER_CLASS is set, the character range of both \w and \b is consistent.

In Semeru 21 or higher version, the \b and \w behaviour will be consistent whether UNICODE_CHARACTER_CLASS is set or not. The \b matcher is now uses ASCII_WORD() predicate in java.util.regex.CharPredicates to get the same range of characters as \w for determining word boundaries.

I am aware about two files where the change is required: FindReplaceDocumentAdapter class of org.eclipse.jface.text and PatternConstructor.java of org.eclipse.search.core.
Please set UNICODE_CHARACTER_CLASS there.
AFAIK there are no side effects for the change.

Tested under this environment:

  • OS & version: Windows 11
  • Eclipse IDE/Platform version (as shown in Help > About): Version: 2024-03 (4.31.0)

Community

  • I understand reporting an issue to this OSS project does not mandate anyone to fix it. Other contributors may consider the issue, or not, at their own convenience. The most efficient way to get it fixed is that I fix it myself and contribute it back as a good quality patch to the project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions