Performance issues

dnaber · September 15, 2017, 9:28am

Hi, we’re having performance issues on languagetool.org. Running a profiler on the live system, I cannot see a specific reason. It looks as if the increasing traffic at peak times is just too much for the server. I’ll continue investigating this. For the next few hours, the server will run without the ngram data to see if that makes a difference.

dnaber · September 15, 2017, 4:07pm

I found one reason for the performance issue, it’s the rule SPACE_BEFORE_PUNCTUATION in pt/grammar.xml which I have commented out for now. It can cause exponential runtime when running on an input that ends in thousands of line breaks (\n). @tiagosantos, maybe you can find a regex that works fast even for such strange documents.

tiagosantos · September 15, 2017, 4:56pm

As discussed on git comments, it is fixed.

might be also worth cheching

there were 15 rules with performance warnings before the logging was removed. Search for ms for rule.

Might also be a good idea to optimize all the long regexps in all languages.
https://languagetool.org/regression-tests/performance-pt-PT.png

KnowZero · September 15, 2017, 5:19pm

Is it due to my module which heavily depends on line breaks to split up the formating?

tiagosantos · September 15, 2017, 9:24pm

@dnaber
Regression test performance seems stable

20170914 pt-PT 787.87 9.92
20170914 pt-BR 767.08 9.81
20170915 pt-PT 813.80 9.75
20170915 pt-BR 724.92 9.50

slight increase in pt-PT may due due to the extra anti-pattern.
I have to add unification to the said anti-pattern, so I will test the other heavy hitters one day after tomorrow and them resume optimizations.

@KnowZero
Unless the splitter produces hundreds of \n in a row, it is unlikely.
Anyway, the rule now checks for only one space and in my limited tests, no issue exists with multiple line endings (at least, when using only up to 4900 \n in a row). Please confirm if there are still hangs or runtime time-outs when running the portuguese language.

tiagosantos · September 18, 2017, 8:20am

Yesterday I tested the dash rule. The rule still has some performance issues, so pt-BR will be slower today. I disabled the rule anyway, and it is again an optional rule to be used offline, if required.

20170917 pt-PT 782.05 11.57
20170917 pt-BR 2635.93 20.86

That rule is used in other languages and has a performance dependent on the database size.
May be a good place to look into if looking for server performance improvements, since it is used in other popular languages.

tiagosantos · September 19, 2017, 12:44pm

@danielnaber
I have been running tests on 20170914 builds and the ones from today.
Testing several long English texts and Portuguese texts has negligible differences on both builds.
Since you referred that the entire server was slowed down.
It seems that this was a directed action. The exponential runtime effect you have found would only be achieved if running that “tailored” text (word+hundred line breaks + punctuation) in the Portuguese page.

You were able to pin-point the rule that was being exploited while debugging the live system. Were you also able to pin-point the source of those “spellchecks”?

dnaber · September 19, 2017, 12:56pm

I’ve not tried finding the source of the requests. I don’t see any reason to believe that this was malicious. If it was, why didn’t they send hundreds of these requests instead of a few?

tiagosantos · September 19, 2017, 1:23pm

Wouldn’t just a few produce a server slowdown? I thought it would be continuous requests since the server times-out once in a while.

dnaber · September 19, 2017, 1:44pm

There’s a timeout for any request of about 10 seconds or so. The slowdown was 10-15 minutes and could be resolved by restarting the LT server process.

tiagosantos · September 19, 2017, 3:41pm

I was imagining the time-out mechanism working in a different way. I am not sure how doable that is, but setting a hard limit of 10s (for example) to any request, will safeguard this type of situations.
I am attentive to the code, but I still have trouble figuring out how that regexp triggered that effect. Only the \b matches \n, and they would fail with the [\p{L}\d]+ that followed it.

dnaber · September 30, 2017, 8:25am

Even though the issue is solved now, a long-term solution that prevents these kinds of problems would maybe to use dk.brics.automaton - finite-state automata and regular expressions for Java or GitHub - google/re2j: linear time regular expression matching in Java in the future. It should have linear runtime behavior.

tiagosantos · October 1, 2017, 5:11pm

Only the symptom. There are no tests for it and a new exploit or regexp may cause similar problems.
Changing the regexp engine to DFA could solve this extreme examples, but worsen usual load combined performance.
Another problem is that a review of all the patterns may be needed. Can it be automated?

PS - As an example, try a thousand 0s, a final stop (.), a thousand 0s a space and a copy paste again. Do it even in the German language. You will have the same exponencial runtime effect as well. With an obvious silly example, that hardly could be used in a legitimate way.

curon · March 8, 2018, 1:39pm

Sorry for reopening an old thread. I had wondered if the XML rules were run serially on every token, or if there were some optimisations. It appears to me that every XML rule is checked on every possible part of a sentence, although I may be missing something in my interpretation of the code here (Java programming is not my forte). It would appear to be possible to combine the rules at startup (or maybe when compiling), by identifying the least probable matching token of a rule (easy if a fixed string, harder if a regexp with an ngram lookup). A string matching token can be easily used as keys in a Map, whilst FSA could be used via this unmaintained multiregexp (of course, strings could be easily converted to a regexp):
https://fulmicoton.com/posts/multiregexp/

I wonder if there would be a significant improvement with this approach, as the benchmark suggests a drastic improvement in comparison to checking 3 different regexps.

dnaber · March 8, 2018, 2:10pm

There are several optimizations, e.g. running several threads, and using shortcuts when possible. Whether using a different regex engine would help - I don’t know.

An advanced version of GitHub - knowitall/openregex: An efficient and flexible token-based regular expression language and engine. might also help - this doesn’t work on the characters level, but on the token level. All rules could then be compiled into one large automaton.