Out of memory

Due to me adding a huge generated rulegroup that suggests the most common alternative for a plural, the amount of rules in the grammar file grew from around 1600 to about 8000. And there is more where this comes from.
Apart from not able to edit comfortably anymore in the IDE, there might be more consequences. If any of these are detected, please inform me.

Could I have those generated rule sets in a kind of include in the grammar.xml?

One consequence is that checking becomes slower. What do these auto-generated rules look like?

Of course. But quality always takes time. There are multiple types of rules:

        <rule>
            <pattern>
                <marker>
                    <token>aanbestedingsplatformen</token>
                </marker>
            </pattern>
            <message>Het gebruikelijker meervoud voor 'aanbestedingsplatformen' is <suggestion>aanbestedingsplatforms</suggestion>.</message>
            <example correction="aanbestedingsplatforms">Ik vind <marker>aanbestedingsplatformen</marker> ongebruikelijk.</example>
        </rule>

And the other one (so far) is

        <rule>
            <pattern>
                <marker>
                    <token inflected="yes">liberalisatie</token>
                </marker>
            </pattern>
            <message>Een gebruikelijker woord voor 'liberalisatie' is 'liberalisering'.</message>
            <example type="incorrect">Ik vind <marker>liberalisatie</marker> ongebruikelijk.</example>
            <example>Een gebruikelijker woord is <marker>liberalisering</marker>.</example>
        </rule>

These rules are generated when the word frequency of both equivalent forms differ quite a bit.
But if they don’t differ that much, they will be consistency checking pairs.

So either way, they will have an effect on performance.

Of course there is an option to default switch off the rule. But I could also limit the amount of rules being generated using word frequency, reducing the quality.

What is your idea?

But having this list as a Java rule might be much faster. Also, you don’t have the issue with the XML becoming huge. If you send me the new XML by email, I can test its effects on performance.

Okay, I will.

I understand. But there will be multiple rules:
More common ,
plural, gaven vs gaven, does not need regexp
diminutive, kindjes vs kindertjes, needs regexp for last s, or more entries
More common synonyms reservering vs reservatie, boekenworm vs bookworm, fiets vs rijwiel
Form with or without concatenating s: drugstest vs drugtest.

Or… one data file with flag, and multiple messages possible, either in code, or in configurating header or file.

Results - all times are milliseconds per sentence, on average:

  • the grammar.xml in git: 25ms
  • the grammar.xml in git, without ONGEBRUIKELIJK_MEERVOUD: 22ms
  • the file you sent me: 30ms
  • the file you sent me, without UNUSUAL_WORDS: 23ms

So while UNUSUAL_WORDS adds 25% or so runtime, I think this is still okay. Ukrainian splits its rules into several files, if you want, we can do that for Dutch, too (requires small change to Dutch.java).

That might be an idea for managability reasons. I will have a look at Ukr first, to see what it looks like.

Here’s another measurement, which looks worse, but I guess we can live with it:

https://languagetool.org/regression-tests/performance-nl.png

Okay. It is all in the Java rule now.

There are a lot of (almost equally frequent) spelling variations of more words. I am in doubt of adding those to the consistency rule, since i read the report it might slow things down for larger documents.

Will have to find a test for that.