Grammar.xml size

Ruud_Baars · March 13, 2018, 3:10pm

Is it bad for technical and performance reasons to have a large grammar.xml? Do comments load into memory, or are those discarded in production?

The rationale behind the question is that it is great for experimentation to have a large amount of examples per rule; checking the effects of a rule change is immediate.

dnaber · March 13, 2018, 3:22pm

You can watch performance for Dutch at https://languagetool.org/regression-tests/performance-nl.png

I’m quite sure they are not loaded into memory.

Ruud_Baars · March 13, 2018, 3:35pm

That is quite a steep increase. Since the amoynt of rules is equal, the examples and/or comments cause quite a bit of load. A good reason to trim it down to the essential.

Dominique_PELLE · March 13, 2018, 4:21pm

The XML file has to be parsed at startup. If it’s large, it causes a delay at startup.
Then of course, many rules have their cost when checking text.

I remember measuring LT startup time in several language and several releases
a few years ago. I should probably do that again. Actually, using LT with my Vim
plugin, it seems that LT startup time is slow now, but I have not measured.

Comments are probably cheap to parse and not loaded in memory so I would not worry about those.

I’m pretty sure that XML files are loaded lazily, so if I only check grammar in French for example,
I won’t be penalized with large Dutch XML files.

Ruud_Baars · March 13, 2018, 5:07pm

It appears the increase was around 23-25 of last month. That is before the grammar size increased. It must be one of the other changes, like the coherency check or the preferred words rule file.
I will temporary remove the preferred words checks, so we can see what it does to performance.

Ruud_Baars · March 14, 2018, 8:12am

I that graph updated automagically?

dnaber · March 14, 2018, 8:22am

Yes, just make sure your browser doesn’t cache it.

Ruud_Baars · March 14, 2018, 9:24am

Which makes clear it is the ‘preferred words’ that creates the load. With 8000 entries, that is not a surprise. There is an intrinsic issue in it. Part could be solved by not adding the least used words to the spellchecker (but that is removing a valid word). And of course the less frequent words are rarely found an reported. So it has to be reduced to those spelling variants that are both relatively common. Needs some work…

For now, I will have the file empty, until I have tome to filter it.