Back to LanguageTool Homepage - Privacy - Imprint

Out of memory


(Ruud Baars) #1

Due to me adding a huge generated rulegroup that suggests the most common alternative for a plural, the amount of rules in the grammar file grew from around 1600 to about 8000. And there is more where this comes from.
Apart from not able to edit comfortably anymore in the IDE, there might be more consequences. If any of these are detected, please inform me.

Could I have those generated rule sets in a kind of include in the grammar.xml?


(Daniel Naber) #2

One consequence is that checking becomes slower. What do these auto-generated rules look like?


(Ruud Baars) #3

Of course. But quality always takes time. There are multiple types of rules:

        <rule>
            <pattern>
                <marker>
                    <token>aanbestedingsplatformen</token>
                </marker>
            </pattern>
            <message>Het gebruikelijker meervoud voor 'aanbestedingsplatformen' is <suggestion>aanbestedingsplatforms</suggestion>.</message>
            <example correction="aanbestedingsplatforms">Ik vind <marker>aanbestedingsplatformen</marker> ongebruikelijk.</example>
        </rule>

And the other one (so far) is

        <rule>
            <pattern>
                <marker>
                    <token inflected="yes">liberalisatie</token>
                </marker>
            </pattern>
            <message>Een gebruikelijker woord voor 'liberalisatie' is 'liberalisering'.</message>
            <example type="incorrect">Ik vind <marker>liberalisatie</marker> ongebruikelijk.</example>
            <example>Een gebruikelijker woord is <marker>liberalisering</marker>.</example>
        </rule>

These rules are generated when the word frequency of both equivalent forms differ quite a bit.
But if they don’t differ that much, they will be consistency checking pairs.

So either way, they will have an effect on performance.

Of course there is an option to default switch off the rule. But I could also limit the amount of rules being generated using word frequency, reducing the quality.

What is your idea?


(Daniel Naber) #4

But having this list as a Java rule might be much faster. Also, you don’t have the issue with the XML becoming huge. If you send me the new XML by email, I can test its effects on performance.


(Ruud Baars) #5

Okay, I will.


(Ruud Baars) #6

I understand. But there will be multiple rules:
More common ,
plural, gaven vs gaven, does not need regexp
diminutive, kindjes vs kindertjes, needs regexp for last s, or more entries
More common synonyms reservering vs reservatie, boekenworm vs bookworm, fiets vs rijwiel
Form with or without concatenating s: drugstest vs drugtest.

Or… one data file with flag, and multiple messages possible, either in code, or in configurating header or file.


(Daniel Naber) #7

Results - all times are milliseconds per sentence, on average:

  • the grammar.xml in git: 25ms
  • the grammar.xml in git, without ONGEBRUIKELIJK_MEERVOUD: 22ms
  • the file you sent me: 30ms
  • the file you sent me, without UNUSUAL_WORDS: 23ms

So while UNUSUAL_WORDS adds 25% or so runtime, I think this is still okay. Ukrainian splits its rules into several files, if you want, we can do that for Dutch, too (requires small change to Dutch.java).


(Ruud Baars) #8

That might be an idea for managability reasons. I will have a look at Ukr first, to see what it looks like.


(Daniel Naber) #9

Here’s another measurement, which looks worse, but I guess we can live with it:


(Ruud Baars) #10

Okay. It is all in the Java rule now.

There are a lot of (almost equally frequent) spelling variations of more words. I am in doubt of adding those to the consistency rule, since i read the report it might slow things down for larger documents.

Will have to find a test for that.