I think this is the core of the task, as detecting errors and creating suggestions is something LT already does, using morfologik. It’s basically sorting the suggestions by a kind of Levenshtein distance, i.e. not considering context. I think the key questions is: how do you combine the distance (word with typo <> correction candidate) with the ngram data (plus maybe pure frequency data from yet another source)?
- Just sorting by frequency and not considering the distance is dangerous, as the ngram data is created mostly from books.
- Just sorting by distance totally ignores the context (that’s what we do now).
So these values somehow need to be combined, and a NN could learn how to combine them to get the best result. The best result is when the most valid suggestion is on top. We have data about that, collected on languagetool.org.
I don’t know spring-boot, but would prefer something lightweight. The issue is not to make it work, but to make it work stable in an environment where requests are arriving randomly and answering the requests can take anywhere between 20ms and 10 seconds. In the past, we had issues with the server becoming overloaded. This needs to be avoided, but at the same time, we don’t want to reject every request as soon as there’s some load. So currently we use an approach where requests queue up, but the queue has a maximum length. Once it’s full, clients get an error. I think this approach makes sense.