MorfologikSpellerRule performance

arysin · May 21, 2020, 8:56pm

I ran some rule profiling for Ukrainian text and I see that morfologik spelling rule takes over 30% of all (745) rules:
Analyze time: 8463 ms, 3.0 sent/sec
Disambig time: 19297 ms, 1.322952 sent/sec
Testing 745 rules
Rule ID Time Sentences Matches Sentences per sec.
…
MORFOLOGIK_RULE_UK_UA 76694 25529 4725 332.9
…
Total rule time: 236857 ms

It’s kinda similar for en-US (7% out of 3210 rules):
MORFOLOGIK_RULE_EN_US 30151 7853 5955 260.5
Total rule time: 444435 ms

So speller takes as much time as ~200 other rules on average together.

I know we added some cool logic for splits/joins with previous/next word but I am wondering if we could optimize that.

dnaber · May 21, 2020, 9:41pm

I don’t think the split/join logic is the reason for the spell checker being slow, but the fact that generating the suggestions is slow. Just detecting whether a word is misspelled should be very fast. I don’t have a simple solution. There are spell checkers like GitHub - wolfgarbe/SymSpell: SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm which are supposed to be very fast, but they don’t work with hunspell dictionaries. For some languages, the hunspell dict could be expanded into a simple list of words (without flags) which could then be used with SymSpell.

arysin · May 21, 2020, 10:20pm

You are right, it’s the suggestions, here’s the top stack from JMC

int java.lang.Character.toLowerCase(int)	1039
char java.lang.Character.toLowerCase(char)	989
boolean morfologik.speller.Speller.areEqual(char, char)	985
int morfologik.speller.Speller.ed(int, int, int, int)	985
int morfologik.speller.Speller.cuted(int, int, int)	983
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	983
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	982
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	976
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	971
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	938
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	931
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	806
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	769
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	460
void morfologik.speller.Speller.findRepl(List, int, int, byte[], int, int)	419
ArrayList morfologik.speller.Speller.findReplacementCandidates(String)	218
List org.languagetool.rules.spelling.morfologik.MorfologikSpeller.getSuggestions(String)	218
List org.languagetool.rules.spelling.morfologik.MorfologikMultiSpeller.getSuggestionsFromSpellers(String, List)	218
List org.languagetool.rules.spelling.morfologik.MorfologikMultiSpeller.getSuggestionsFromDefaultDicts(String)	218
List org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.getRuleMatches(String, int, AnalyzedSentence, List, int, AnalyzedTokenReadings[])	218

arysin · May 21, 2020, 10:24pm

I actually just looked and top unknown words are single letters and name initials (with dots). Suggestions for initials is mostly useless, and for single letters the ones directly from morfologik are not useful either. So I could probably optimize there.

Ruud_Baars · May 22, 2020, 8:35am

There is more possible. One could analyze a corpus for words mosts commonly done wrong, and have the suggestions using the current spellchecker. By saving the suggestions with the word in a separate data file, it would be easy to check this data file for suggestions first, and generate others of not in the list. Finding must be a lot faster then generating. Another plus: this list could be manually edited to have the desired word order as well as remove unlikely susggestions.

arysin · May 22, 2020, 3:09pm

@Ruud_Baars That’s a good point! We are already doing it for thousands of words, but I guess I can extend it to cover very short ones. I didn’t realize it has performance benefit as well.

Ruud_Baars · May 28, 2020, 5:30am

Yeah, but this is not available in all languages. And as far as I can see, only in combination with Hunspell dictionaries.

arysin · May 28, 2020, 1:14pm

Hmm, we’re using SimpleReplaceRule - that’s available to any language and is not tied to the spelling dictionary at all.