Back to LanguageTool Homepage - Privacy - Imprint

Grammar.xml size


(Ruud Baars) #1

Is it bad for technical and performance reasons to have a large grammar.xml? Do comments load into memory, or are those discarded in production?

The rationale behind the question is that it is great for experimentation to have a large amount of examples per rule; checking the effects of a rule change is immediate.


(Daniel Naber) #2

You can watch performance for Dutch at https://languagetool.org/regression-tests/performance-nl.png

I’m quite sure they are not loaded into memory.


(Ruud Baars) #3

That is quite a steep increase. Since the amoynt of rules is equal, the examples and/or comments cause quite a bit of load. A good reason to trim it down to the essential.


(Dominique PELLÉ) #4

The XML file has to be parsed at startup. If it’s large, it causes a delay at startup.
Then of course, many rules have their cost when checking text.

I remember measuring LT startup time in several language and several releases
a few years ago. I should probably do that again. Actually, using LT with my Vim
plugin, it seems that LT startup time is slow now, but I have not measured.

Comments are probably cheap to parse and not loaded in memory so I would not worry about those.

I’m pretty sure that XML files are loaded lazily, so if I only check grammar in French for example,
I won’t be penalized with large Dutch XML files.


(Ruud Baars) #5

It appears the increase was around 23-25 of last month. That is before the grammar size increased. It must be one of the other changes, like the coherency check or the preferred words rule file.
I will temporary remove the preferred words checks, so we can see what it does to performance.


(Ruud Baars) #6

I that graph updated automagically?


(Daniel Naber) #7

Yes, just make sure your browser doesn’t cache it.


(Ruud Baars) #8

Which makes clear it is the ‘preferred words’ that creates the load. With 8000 entries, that is not a surprise. There is an intrinsic issue in it. Part could be solved by not adding the least used words to the spellchecker (but that is removing a valid word). And of course the less frequent words are rarely found an reported. So it has to be reduced to those spelling variants that are both relatively common. Needs some work…

For now, I will have the file empty, until I have tome to filter it.