I want to use LanguageTool for correction of a large amount of product descriptions (fashion online shops). We need to eliminate spelling errors (sometimes grammar errors, as well) for further processing. Of course, this should mostly happen automatically.
After quite a time of research in this forum/API/documentation, I’ve realized, that LanguageTool is not really good at that (and that’s not the purpose of LT if I’m right?). So I want to implement this with an own method.
The approach would be:
- Create a word list from a product description (all single words with number of their appearance)
- Check new product descriptions
- If there are multiple suggestions for a word -> look up the word in our own word list and set the word as the correct one, that appears the most in our word list.
Example - Correct sentence:
Das schwarze Hemd ist aus Baumwolle.
Example - Wrong sentence:
Das schwarze Hmd ist aus Baumwolle.
Suggestions by LT:
Amt; Hut; Hemd; Hd; …
Obviously, the suggested word “Amt” is not the correct one.
My point is: What’s the best way to tell LT, which word is the right one to choose?
- Use the approach I descibed above (still a risk of choosing a wrong word: What if “Amt” really would be the correct word? and no grammar correction
- Using more n-gram data: I thought, that the LT n-gram dataset could recognize the context, but unfortunately that’s not the case. At least not for the example above.
- Go through each error and correct them manually and afterwards write a rule (JAVA/XML -> what’s better?) for this particular error.
I hope, that someone can help me and tell me, what’s the best ways to approach this issue. Do you think comparing the suggestions with an own word list, is a reasonable way to do it?
Thanks in advance and best regards