Hi, a few days ago, we have made the switch from requiring Java/JDK 8 to Java/JDK 17. If you want to compile the code yourself, you need JDK 17 from now on. Please use JDK17 and not a later version, as there are known issues with JDK 21.
I’ve adjusted \b handling (mostly in PatternRuleHandler, segment.srx, and some language-specific Java code) for 7 languages that were failing with JDK>=19 in jdk19_regexp_fix branch.
All languages, except German are now passing the unit tests with JDK 23.
Unfortunately I could not fix German as it requires more knowledge of the language and the rules, @dnaber could you please take a look?
Note: for command-line tests you’ll need to add “-Djava.security.manager=allow” parameter.
Note: I only adjusted regexp in segment.srx to pass the tests, there are still many regexps that include \b but don’t have (?U) that don’t affect the unit tests (probably because tests for some languages don’t have good coverage). If we can get all the tests passing we can adjust the rest of them too.
Ideally we could just specify UNICODE_CHARACTER_CLASS for all regexps in segmentation module so we don’t have to adjust so many of them one by one, but I could not find a way to do that. So I’ve filed a request on net.loomchild.segment to allow us to do that. If they ever provide this feature we can remove all the (?U) we had to add.
Jarek Lipski from net.loomchild.segment was very kind (and efficient) to provide a new flag that allows us to trigger UNICODE_CHARACTER_CLASS for all rules in segment.srx in one place.
I’ve pushed the change that sets this flag on segment 2.0.4 and revert back all changes to segment.srx
So once we fix [de] we should be all good for latest JDK support in LT.
@dnaber so jdk-21 branch was manually fixing pieces which unfortunately was not compatible with global use of UNICODE_CHARACTER_CLASS.
I was able to debug it and had two more fixes for [de] that made all tests pass in jdk_19 branch:
flag for a regex with \b in AbstractUnitConversionRule that I missed before
change for a rule group in grammar.xml - it used to use \s for whitespace, and in non-Unicode mode \s does not cover U+00A0, but with flag that \s was taking U+00A0 and tripping correct examples.
I’ve changed \s to [ \t] and it all works now, but note that \s also included some vertical whitespaces (e.g. \n) and maybe some other whitespace chars. Please feel free to adjust.
P.S. if we merge this branch we may want to squash commits since there were (lots of) segement.srx changes that I had to revert so if we squash the history will be a bit cleaner.