Spellchecker improvement discussion

oserikov · February 25, 2018, 7:45pm

Continuing the discussion from LT and GSoC 2018 - looking for students:

if we have small data, users can download it and use LT offline.

Offline usage of LT is working via local deployment of the server-side? or did you mean the offline version of the spellchecker only or something else? Is that ability to use LT offline implemented now?
From the architectural point of view, I think, there should always be the way to download the data and use LT offline, but depending on the amount of data the default behavior should vary – that gives us the behavior not depending of the language and avoids the hardcode.

Having looked at the spellchecker’s code I see that it uses neither n-grams nor word-specific feautures such as POS-tags. N-grams are mentioned in the ideas list, but POS-tags are not. I think that POS-tags could be be very useful

There are various examples of the ML-based spellcheckers in the web so that’s not hard to implement the new one, but then it has to be integrated and with java and continously updated. Does LT use trained models somewhere else (to follow these cases technologies stack and style)?

dnaber · February 25, 2018, 9:36pm

That’s how it works, “Stand-alone for your desktop” is the download version, but it doesn’t contain the (huge) ngram data. Also see the comparison table.

Yes, see here.

oserikov · February 26, 2018, 3:41pm

The very high-level description of the ML part of the task from my point of view looks like the following:

from the n-gram corpus extract the n-grams containing the misspellings (find the misspellings using LT)
for each mistake-containing n-gram find all the possible corrections and these corrections frequencies (probabilities)
train the model (nn most likely) to order the corrections by their frequencies

What do you think about that (many heuristics can be added to the described algo, but it’s a little early to think about them in my opinion)?

Few questions more:
having fixed the bug and opened the pull request, should I ask somebody to review that or the opened PR is all I have to do?
LT wants to use a modern framework for embedded HTTP server and the mentioned framework is the sparkjava. Is this the preferred tool? What do you think abut the spring-boot (it’s heavier, but much more functional)?

dnaber · February 26, 2018, 4:43pm

I think this is the core of the task, as detecting errors and creating suggestions is something LT already does, using morfologik. It’s basically sorting the suggestions by a kind of Levenshtein distance, i.e. not considering context. I think the key questions is: how do you combine the distance (word with typo <> correction candidate) with the ngram data (plus maybe pure frequency data from yet another source)?

Just sorting by frequency and not considering the distance is dangerous, as the ngram data is created mostly from books.
Just sorting by distance totally ignores the context (that’s what we do now).

So these values somehow need to be combined, and a NN could learn how to combine them to get the best result. The best result is when the most valid suggestion is on top. We have data about that, collected on languagetool.org.

I don’t know spring-boot, but would prefer something lightweight. The issue is not to make it work, but to make it work stable in an environment where requests are arriving randomly and answering the requests can take anywhere between 20ms and 10 seconds. In the past, we had issues with the server becoming overloaded. This needs to be avoided, but at the same time, we don’t want to reject every request as soon as there’s some load. So currently we use an approach where requests queue up, but the queue has a maximum length. Once it’s full, clients get an error. I think this approach makes sense.

oserikov · February 26, 2018, 6:31pm

You’ve described the current implementation and it sounds to work stable, so what’s the goal? The increase of performance at all?
Is the server-side concurrent? Maybe the implementation or the load-balancing have to be tuned?

oserikov · February 26, 2018, 7:23pm

I thought that since there are many ways to combine these data, the best order of subtasks is the following: start with the integration of the LT spelling corrector with the ML-models prediction mechanism, then create and continuously improve the model.
For the first version of the model I think there’s enough to

learn the distance-feature weight related to all the others features summarized weight on the LT’s user-choosed-the-correction dataset
learn the other features (non-including the distance) relative weight on the n-gram data

The next heuristic is to learn the weight of the non-distance features separately for each language, then it’s possible to group the languages, then to include the keyboard-distance-between-letters feature etc. But since there are various possible implementations of the model and I’d like to try all of them and to be able to publish them without pain, the first thing that has to be done is the implementation of the model applying mechanism.

And the last question. The LT’s user-choosed-the-correction dataset for some less popular than English languages sounds to be relatively small. Is it possible to download that data to check the hypothesis and to explore it?

dnaber · February 26, 2018, 8:24pm

To get cleaner code that’s not as low-level as it is now.
To get a system that deals better with the load - there are still situations in which simple requests take 1-2 seconds. It’s not clear whether this is due to the approach we’re currently using or due to the fact that requests come in randomly and thus there will always be peaks.

Yes.

dnaber · February 26, 2018, 8:40pm

We don’t have direct pairs word1 -> word2, instead we have sentence1 -> sentence2 and the word pairs will first need to be extracted from those. But anyway, there will indeed not be enough data for the languages used less often on languagetool.org.

oserikov · March 25, 2018, 8:56am

Having the suggestions sorting algorithm it would be useful to receive some feedback from the users, so the A|B testing is needed in the perfect world. Will it be possible to add the ABtesting to the LT webpage (for example half of the users will face the new algo and the others will receive the old one, and all of them will be asked whether they’re satisfied by the suggestions order)?

dnaber · March 25, 2018, 9:25am

I don’t know whether asking users makes sense, but we can log their choices in the UI. When they select the first suggestion, that means the first suggestion is indeed the best one. Yes, we could probably combine this with A/B testing.

SkyCharger001 · March 25, 2018, 9:32am

just had an idea: Perhaps we could A/B test it with an opt-in double list,
with the results from both methods side by side.
(and, to reduce bias, randomly select which method goes to which side)

oserikov · March 25, 2018, 9:50am

sorry, maybe I’ve misunderstood. Will the suggestions pop-up look like this: (suggeted corrections are “a”, “b”, “c”, “d”)?

(old algo)	(new algo)
a	d
b	a
c	c
s	b

SkyCharger001 · March 25, 2018, 9:57am

yes, but with a small bit of randomization in regard to which algo gets which side to make it a (double) blind test.

oserikov · March 25, 2018, 10:05am

that would be very useful from the developers’ point of view, but wouldn’t it provoke questions from the users’ side? Maybe this behavior should be easy to switch-off then?

SkyCharger001 · March 25, 2018, 11:03am

That’s why I suggested it as an opt-in.

oserikov · March 25, 2018, 3:19pm

And the measure of quality of the sorting algorithms when comparing them is the number of times when the top 1 suggestion predicted was selected by user. We could take in account the weighted positions of the suggestions selected by user but I don’t see an obvious way to choose the weights etc.

SkyCharger001 · March 25, 2018, 3:42pm

another few ideas:
A. score both algos instead of just the one from which the word was selected.

B. set score for both sides as 1000*pow(pow(0.001,1/(listSize+20)),wordPosition)

oserikov · March 25, 2018, 4:06pm

A. – That’s what I wanted to do, maybe my message was unclear.
B. – The convex-down decreasing function (isn’t it?) is good approach but the constants are unclear to me as I said. I’ll give it a try, anyway.

SkyCharger001 · March 25, 2018, 4:25pm

allows the function to stretch along with the size of the list, while +20 allows small lists to not be too punishing. (but perhaps T&E is needed to get good constants)

oserikov · March 25, 2018, 4:47pm

This is the way I’ll take