I ran some rule profiling for Ukrainian text and I see that morfologik spelling rule takes over 30% of all (745) rules:
Analyze time: 8463 ms, 3.0 sent/sec
Disambig time: 19297 ms, 1.322952 sent/sec
Testing 745 rules
Rule ID Time Sentences Matches Sentences per sec.
…
MORFOLOGIK_RULE_UK_UA 76694 25529 4725 332.9
…
Total rule time: 236857 ms
It’s kinda similar for en-US (7% out of 3210 rules):
MORFOLOGIK_RULE_EN_US 30151 7853 5955 260.5
Total rule time: 444435 ms
So speller takes as much time as ~200 other rules on average together.
I know we added some cool logic for splits/joins with previous/next word but I am wondering if we could optimize that.
I don’t think the split/join logic is the reason for the spell checker being slow, but the fact that generating the suggestions is slow. Just detecting whether a word is misspelled should be very fast. I don’t have a simple solution. There are spell checkers like GitHub - wolfgarbe/SymSpell: SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm which are supposed to be very fast, but they don’t work with hunspell dictionaries. For some languages, the hunspell dict could be expanded into a simple list of words (without flags) which could then be used with SymSpell.
I actually just looked and top unknown words are single letters and name initials (with dots). Suggestions for initials is mostly useless, and for single letters the ones directly from morfologik are not useful either. So I could probably optimize there.
There is more possible. One could analyze a corpus for words mosts commonly done wrong, and have the suggestions using the current spellchecker. By saving the suggestions with the word in a separate data file, it would be easy to check this data file for suggestions first, and generate others of not in the list. Finding must be a lot faster then generating. Another plus: this list could be manually edited to have the desired word order as well as remove unlikely susggestions.
@Ruud_Baars That’s a good point! We are already doing it for thousands of words, but I guess I can extend it to cover very short ones. I didn’t realize it has performance benefit as well.