Sentence splitter

Ruud_Baars · August 2, 2020, 4:42am

There must have been a change in sentence splitting (SRX) that affects Dutch unexpectedly.
A split is suddenly missed in
https://internal1.languagetool.org/regression-tests/via-http/2020-08-02/nl/index.html

Since I am not at home, I can not check what happened. Who changed the srx lately, or is it something else?

dnaber · August 2, 2020, 9:01am

Close to the affected input there was a sentence with a soft hyphen, and handling those has changed. Should this occur again, please let me know.

Ruud_Baars · August 3, 2020, 6:20am

Unexpected redults again. Looks like the sentence split has improved. But now there is a sentence whitespace giving an invalid suggestion. And morfologik that accepts a weird word suddenly.

Must be some side effects…

dnaber · August 3, 2020, 5:39pm

It might be caused by not all servers running the exactly same version of LT (they differed by 2 or 3 days). This should be fixed soon. So please keep an eye on it.

Ruud_Baars · August 6, 2020, 11:30am

I am keeping an eye, but there seem to be no nighlies.

Ruud_Baars · August 7, 2020, 5:17am

Tonights os nightly has a lot of entries. Mostly true positives. Did thectest input change?

dnaber · August 7, 2020, 5:49am

No, but KORT_1 and KORT_2 have lower priority now, causing other (hopefully more specific matches) to appear.

Ruud_Baars · August 7, 2020, 6:19pm

Okay, that is good. Priorities are difficult to tune, since they are in Java, while most rules are xml. Would it be possible to have a rule priority number 0- 99 in rule xml e.g.?

dnaber · August 7, 2020, 6:36pm

Technically yes, but with the current approach all the priorities are in the same place, which makes it easy to compare them. Syntax-wise, it’s trivial to add priorities. The code is here:

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/nl/src/main/java/org/languagetool/language/Dutch.java#L141-L142


      
          
          /** @since 4.5 */

Ruud_Baars · August 7, 2020, 6:53pm

Yes, but it assumes constant id’s, which is not strange, but not always true in maintenance. I will just do without it, until it proves essential. Java Code is something I do not feel competent with.

dnaber · August 7, 2020, 7:27pm

We shouldn’t change IDs. If we do, it breaks user’s configurations, i.e. rules they turned off will suddenly become active again. The only exception I can think of is a very new rule which was just introduced.

Ruud_Baars · August 11, 2020, 4:55am

More unexpected true positives in the nightly. What id the csuse?

dnaber · August 11, 2020, 5:34am

EINDE_ZIN_ONVERWACHT's priority has been decreased so it doesn’t hide more important errors.