More data for the nightly tests

dnaber · August 22, 2019, 3:10pm

Hi, I have increased the number of sentences for the nightly regression test (email subject " LanguageTool (open-source) nightly diff test"). This means the next email might contain many changes that are not caused by changes in the rules/code, but by the larger test data. It would be great if you could check it anyway to find false alarms.

dnaber · August 28, 2019, 1:25pm

I have increased the number of sentences once more, so tonight there will probably be many new matches.

dnaber · September 4, 2019, 7:50am

Tonight’s check will again use more data (10,000 additional sentences).

jaumeortola · September 5, 2019, 8:48am

Hi Daniel,

This increase is very welcomed.

There is only a thing I would ask. The Wikipedia dump you are using is somewhat old, isn’t it? I guess it is older than five years at least, because a lot of Wikipedia sentences have errors that were corrected long time ago. If you use a current dump, we will get sentences of better quality and less true positives in the tests.

dnaber · September 5, 2019, 9:47am

While old errors on Wikipedia have been removed, new ones will have been added But let me know, I can replace the Catalan Wikipedia dump. But it also means you’ll get one huge list the day I replace it. Let me know if you want me to do that.

jaumeortola · September 5, 2019, 10:07am

Yes, I would like to update the Catalan Wikipedia dump. Thank you.

dnaber · September 5, 2019, 10:19am

I’ve replaced the Catalan Wikipedia dump (which was from end of 2012, BTW) with a current one. Let’s see whether that will work tonight… please let me know if it doesn’t.

dnaber · September 10, 2019, 1:41pm

I have increased the number of test sentences for all languages by another 5000 per language.

dnaber · September 16, 2019, 9:42am

I’ve done this once more for tonight’s check (we’re at 100,000 sentence now per language).

dnaber · September 20, 2019, 8:02am

More data has been added again yesterday and is showing up in the emails now.

arysin · September 24, 2019, 5:16pm

Daniel, in this regression I see this error:

+Title: 200
+Line 1, column 41, Rule ID: ABBREVIATIONS_WRONG_DOT[1]
+Message: Стягнені скорочення та метричні одиниці пишуться без крапки: 'млн'
+Suggestion: млн
+Rule source: /org/languagetool/rules/uk/grammar-spelling.xml
+За оцінками населення Землі налічує 257 млн.
+                                        ^^^^

But when I look for this sentence at the Wikipedia (including downloaded xml dump) I see that the dot is followed by space and then two new lines (so rule should not be triggered). And when I copy those lines and test them locally I don’t see an error.
Where is the best place to look how regression tests are run (most importantly how the text around this sentence looks like)?

dnaber · September 24, 2019, 7:12pm

The command we run is this:

java -Xmx4500M -jar languagetool-wikipedia.jar check-data -l $lang -f $wikiFile -f $tatoebaFile --max-sentences $maxSentences --languagemodel $ngramDir

However, the Wikipedia dump for Ukrainian is from 2012-12-30, so it might be difficult to find it online (I can send it if you want).

arysin · September 24, 2019, 7:59pm

Thanks, I’ve found the problem.
In the rule I have exception for
<token negate_pos="yes" postag_regexp="yes" postag="SENT_END.*">.</token>
And when I check text extracted from Wikipedia directly (even with two new lines and next paragraph) it’s tagged as
.[</S>]
But when I run languagetool-wikipedia.jar it gets tagged as
.[</S><P/>]
so my exception fails. I am not sure what the right fix here (postag=“SENT_END.*” didn’t seem to help here).

BTW I’ve seen some other complexities with handling SENT_END in rules and I feel like it would be cleaner if we move SENT_END/PARA_END beyond last token (like we move SENT_START before first token).

dnaber · October 11, 2019, 2:54pm

More data (+ 5000 sentences per language) has been added again and the results should be in tomorrow’s emails.