Is LanguageTool suitable for my Usecase?

Hi All,

Just started playing with LanguageTool as part of a home project for the kids and was wondering if one of you would be able to help with a few questions. If I’ve missed some documentation somewhere please just re-point me…

Use Case: Ideally would like to submit a (very large) number of ‘words’ to LanguageTool (most of which are misspelt / not words) and to be returned with just the correctly spelt words. I thought I’d be able to adapt langTool.check(wordList) to my needs but it’s not working as I’d expect (if it called out every ‘word’ that isn’t actually a word in the English language then I could wrap logic around this to make it work)

Using the example code I submit:-

Word List: "OWL PIG DFH BAB COD CID TIN TIM ABC DEF GHI JKL"

Example code for reference:-

try {
    JLanguageTool langTool = new JLanguageTool(new BritishEnglish());
        for (Rule rule : langTool.getAllRules()) {
            if (!rule.isDictionaryBasedSpellingRule()) {
                langTool.disableRule(rule.getId());
            }
        }
        List<RuleMatch> matches;
        **matches = langTool.check(wordList);**
        for (RuleMatch match : matches) {
            logger.debug("Potential typo at characters " +
              match.getFromPos() + "-" + match.getToPos() + ": " +
              match.getMessage());
              match.getSuggestedReplacements());
     }

The output is:-

Potential typo at characters 8-11: Possible spelling mistake found.
Potential typo at characters 12-15: Possible spelling mistake found.
Potential typo at characters 44-47: Possible spelling mistake found.

So whilst DFH, BAB & JKL are called out as possible spelling mistakes, it allows through CID TIM ABC DEF GHI - Is this because these words are so ‘badly spelt’ that it has no correction? Is there another rule I could use?

Also, can anyone give me a rough idea of the performance limitations? Running on reasonably speced Windows machine? What is a sensible limit to the number of words I can supply in one go? My Use case calls for several thousand which seems to be problematic…

And finally, if anyone knows of a better (free) solution or API that would be better placed to meet my requirements could you get me know?

Thanks very much in advance
Steve

You can use English Speller Word Lookup to see if a word is in the dictionary. It’s not exactly the same dataset as the one used by LT, but close. I think many three-character words with all-uppercase letters will be accepted, as these are acronyms or uppercased variants of common words.