More data for the nightly tests

Hi, I have increased the number of sentences for the nightly regression test (email subject " LanguageTool (open-source) nightly diff test"). This means the next email might contain many changes that are not caused by changes in the rules/code, but by the larger test data. It would be great if you could check it anyway to find false alarms.

I have increased the number of sentences once more, so tonight there will probably be many new matches.

Tonight’s check will again use more data (10,000 additional sentences).

Hi Daniel,

This increase is very welcomed.

There is only a thing I would ask. The Wikipedia dump you are using is somewhat old, isn’t it? I guess it is older than five years at least, because a lot of Wikipedia sentences have errors that were corrected long time ago. If you use a current dump, we will get sentences of better quality and less true positives in the tests.

While old errors on Wikipedia have been removed, new ones will have been added :slight_smile: But let me know, I can replace the Catalan Wikipedia dump. But it also means you’ll get one huge list the day I replace it. Let me know if you want me to do that.

Yes, I would like to update the Catalan Wikipedia dump. Thank you.

I’ve replaced the Catalan Wikipedia dump (which was from end of 2012, BTW) with a current one. Let’s see whether that will work tonight… please let me know if it doesn’t.

I have increased the number of test sentences for all languages by another 5000 per language.

I’ve done this once more for tonight’s check (we’re at 100,000 sentence now per language).

1 Like

More data has been added again yesterday and is showing up in the emails now.

Daniel, in this regression I see this error:

+Title: 200
+Line 1, column 41, Rule ID: ABBREVIATIONS_WRONG_DOT[1]
+Message: Стягнені скорочення та метричні одиниці пишуться без крапки: 'млн'
+Suggestion: млн
+Rule source: /org/languagetool/rules/uk/grammar-spelling.xml
+За оцінками населення Землі налічує 257 млн.
+                                        ^^^^

But when I look for this sentence at the Wikipedia (including downloaded xml dump) I see that the dot is followed by space and then two new lines (so rule should not be triggered). And when I copy those lines and test them locally I don’t see an error.
Where is the best place to look how regression tests are run (most importantly how the text around this sentence looks like)?

The command we run is this:

java -Xmx4500M -jar languagetool-wikipedia.jar check-data -l $lang -f $wikiFile -f $tatoebaFile --max-sentences $maxSentences --languagemodel $ngramDir

However, the Wikipedia dump for Ukrainian is from 2012-12-30, so it might be difficult to find it online (I can send it if you want).

Thanks, I’ve found the problem.
In the rule I have exception for
<token negate_pos="yes" postag_regexp="yes" postag="SENT_END.*">.</token>
And when I check text extracted from Wikipedia directly (even with two new lines and next paragraph) it’s tagged as
.[</S>]
But when I run languagetool-wikipedia.jar it gets tagged as
.[</S><P/>]
so my exception fails. I am not sure what the right fix here (postag=“SENT_END.*” didn’t seem to help here).

BTW I’ve seen some other complexities with handling SENT_END in rules and I feel like it would be cleaner if we move SENT_END/PARA_END beyond last token (like we move SENT_START before first token).

More data (+ 5000 sentences per language) has been added again and the results should be in tomorrow’s emails.