Spellchecker improvement discussion

That’s how it works, “Stand-alone for your desktop” is the download version, but it doesn’t contain the (huge) ngram data. Also see the comparison table.

Yes, see here.

The very high-level description of the ML part of the task from my point of view looks like the following:

  • from the n-gram corpus extract the n-grams containing the misspellings (find the misspellings using LT)
  • for each mistake-containing n-gram find all the possible corrections and these corrections frequencies (probabilities)
  • train the model (nn most likely) to order the corrections by their frequencies

What do you think about that (many heuristics can be added to the described algo, but it’s a little early to think about them in my opinion)?

Few questions more:
having fixed the bug and opened the pull request, should I ask somebody to review that or the opened PR is all I have to do?
LT wants to use a modern framework for embedded HTTP server and the mentioned framework is the sparkjava. Is this the preferred tool? What do you think abut the spring-boot (it’s heavier, but much more functional)?

I think this is the core of the task, as detecting errors and creating suggestions is something LT already does, using morfologik. It’s basically sorting the suggestions by a kind of Levenshtein distance, i.e. not considering context. I think the key questions is: how do you combine the distance (word with typo <> correction candidate) with the ngram data (plus maybe pure frequency data from yet another source)?

  • Just sorting by frequency and not considering the distance is dangerous, as the ngram data is created mostly from books.
  • Just sorting by distance totally ignores the context (that’s what we do now).

So these values somehow need to be combined, and a NN could learn how to combine them to get the best result. The best result is when the most valid suggestion is on top. We have data about that, collected on languagetool.org.

I don’t know spring-boot, but would prefer something lightweight. The issue is not to make it work, but to make it work stable in an environment where requests are arriving randomly and answering the requests can take anywhere between 20ms and 10 seconds. In the past, we had issues with the server becoming overloaded. This needs to be avoided, but at the same time, we don’t want to reject every request as soon as there’s some load. So currently we use an approach where requests queue up, but the queue has a maximum length. Once it’s full, clients get an error. I think this approach makes sense.

You’ve described the current implementation and it sounds to work stable, so what’s the goal? The increase of performance at all?
Is the server-side concurrent? Maybe the implementation or the load-balancing have to be tuned?

I thought that since there are many ways to combine these data, the best order of subtasks is the following: start with the integration of the LT spelling corrector with the ML-models prediction mechanism, then create and continuously improve the model.
For the first version of the model I think there’s enough to

  • learn the distance-feature weight related to all the others features summarized weight on the LT’s user-choosed-the-correction dataset
  • learn the other features (non-including the distance) relative weight on the n-gram data

The next heuristic is to learn the weight of the non-distance features separately for each language, then it’s possible to group the languages, then to include the keyboard-distance-between-letters feature etc. But since there are various possible implementations of the model and I’d like to try all of them and to be able to publish them without pain, the first thing that has to be done is the implementation of the model applying mechanism.

And the last question. The LT’s user-choosed-the-correction dataset for some less popular than English languages sounds to be relatively small. Is it possible to download that data to check the hypothesis and to explore it?

  • To get cleaner code that’s not as low-level as it is now.
  • To get a system that deals better with the load - there are still situations in which simple requests take 1-2 seconds. It’s not clear whether this is due to the approach we’re currently using or due to the fact that requests come in randomly and thus there will always be peaks.

Yes.

We don’t have direct pairs word1 -> word2, instead we have sentence1 -> sentence2 and the word pairs will first need to be extracted from those. But anyway, there will indeed not be enough data for the languages used less often on languagetool.org.

Having the suggestions sorting algorithm it would be useful to receive some feedback from the users, so the A|B testing is needed in the perfect world. Will it be possible to add the ABtesting to the LT webpage (for example half of the users will face the new algo and the others will receive the old one, and all of them will be asked whether they’re satisfied by the suggestions order)?

I don’t know whether asking users makes sense, but we can log their choices in the UI. When they select the first suggestion, that means the first suggestion is indeed the best one. Yes, we could probably combine this with A/B testing.

just had an idea: Perhaps we could A/B test it with an opt-in double list,
with the results from both methods side by side.
(and, to reduce bias, randomly select which method goes to which side)

sorry, maybe I’ve misunderstood. Will the suggestions pop-up look like this: (suggeted corrections are “a”, “b”, “c”, “d”)?

(old algo) (new algo)
a d
b a
c c
s b

yes, but with a small bit of randomization in regard to which algo gets which side to make it a (double) blind test.

that would be very useful from the developers’ point of view, but wouldn’t it provoke questions from the users’ side? Maybe this behavior should be easy to switch-off then?

That’s why I suggested it as an opt-in.

And the measure of quality of the sorting algorithms when comparing them is the number of times when the top 1 suggestion predicted was selected by user. We could take in account the weighted positions of the suggestions selected by user but I don’t see an obvious way to choose the weights etc.

another few ideas:
A. score both algos instead of just the one from which the word was selected.

B. set score for both sides as 1000*pow(pow(0.001,1/(listSize+20)),wordPosition)

A. – That’s what I wanted to do, maybe my message was unclear.
B. – The convex-down decreasing function (isn’t it?) is good approach but the constants are unclear to me as I said. I’ll give it a try, anyway.

allows the function to stretch along with the size of the list, while +20 allows small lists to not be too punishing. (but perhaps T&E is needed to get good constants)

This is the way I’ll take :slight_smile:

There’s a thing confusing me.
The stored pair <not corrected sentence, corrected sentence> may be impossible to reproduce using the current version of the LT (e.g. the pair was received using the old version of the LT and the rules are changed now and the current version of the LT suggests other replacements etc). How possible do you think it is?
Since I cannot explore that private stored data, I could provide the command-line tool to check whether the mentioned problem exists. What do you think about?