Out of memory

Ruud_Baars · February 23, 2018, 7:15am

Due to me adding a huge generated rulegroup that suggests the most common alternative for a plural, the amount of rules in the grammar file grew from around 1600 to about 8000. And there is more where this comes from.
Apart from not able to edit comfortably anymore in the IDE, there might be more consequences. If any of these are detected, please inform me.

Could I have those generated rule sets in a kind of include in the grammar.xml?

dnaber · February 23, 2018, 8:34am

One consequence is that checking becomes slower. What do these auto-generated rules look like?

Ruud_Baars · February 23, 2018, 8:50am

Of course. But quality always takes time. There are multiple types of rules:

        <rule>
            <pattern>
                <marker>
                    <token>aanbestedingsplatformen</token>
                </marker>
            </pattern>
            <message>Het gebruikelijker meervoud voor 'aanbestedingsplatformen' is <suggestion>aanbestedingsplatforms</suggestion>.</message>
            <example correction="aanbestedingsplatforms">Ik vind <marker>aanbestedingsplatformen</marker> ongebruikelijk.</example>
        </rule>

And the other one (so far) is

        <rule>
            <pattern>
                <marker>
                    <token inflected="yes">liberalisatie</token>
                </marker>
            </pattern>
            <message>Een gebruikelijker woord voor 'liberalisatie' is 'liberalisering'.</message>
            <example type="incorrect">Ik vind <marker>liberalisatie</marker> ongebruikelijk.</example>
            <example>Een gebruikelijker woord is <marker>liberalisering</marker>.</example>
        </rule>

These rules are generated when the word frequency of both equivalent forms differ quite a bit.
But if they don’t differ that much, they will be consistency checking pairs.

So either way, they will have an effect on performance.

Of course there is an option to default switch off the rule. But I could also limit the amount of rules being generated using word frequency, reducing the quality.

What is your idea?

dnaber · February 23, 2018, 9:14am

But having this list as a Java rule might be much faster. Also, you don’t have the issue with the XML becoming huge. If you send me the new XML by email, I can test its effects on performance.

Ruud_Baars · February 23, 2018, 9:19am

Okay, I will.

Ruud_Baars · February 23, 2018, 3:11pm

I understand. But there will be multiple rules:
More common ,
plural, gaven vs gaven, does not need regexp
diminutive, kindjes vs kindertjes, needs regexp for last s, or more entries
More common synonyms reservering vs reservatie, boekenworm vs bookworm, fiets vs rijwiel
Form with or without concatenating s: drugstest vs drugtest.

Or… one data file with flag, and multiple messages possible, either in code, or in configurating header or file.

dnaber · February 23, 2018, 3:50pm

Results - all times are milliseconds per sentence, on average:

the grammar.xml in git: 25ms
the grammar.xml in git, without ONGEBRUIKELIJK_MEERVOUD: 22ms
the file you sent me: 30ms
the file you sent me, without UNUSUAL_WORDS: 23ms

So while UNUSUAL_WORDS adds 25% or so runtime, I think this is still okay. Ukrainian splits its rules into several files, if you want, we can do that for Dutch, too (requires small change to Dutch.java).

Ruud_Baars · February 23, 2018, 6:16pm

That might be an idea for managability reasons. I will have a look at Ukr first, to see what it looks like.

dnaber · February 26, 2018, 8:19am

Here’s another measurement, which looks worse, but I guess we can live with it:

https://languagetool.org/regression-tests/performance-nl.png

Ruud_Baars · February 26, 2018, 9:11am

Okay. It is all in the Java rule now.

There are a lot of (almost equally frequent) spelling variations of more words. I am in doubt of adding those to the consistency rule, since i read the report it might slow things down for larger documents.

Will have to find a test for that.