Neural Network Rules

I hope so. Everyone would benefit. What I meant was more similar to the way LT deals with n-grams at the moment.
http://wiki.languagetool.org/finding-errors-using-n-gram-data

Add the code to allow it and if user has a folder with neural network language-models, their plugin/standalone-tool/server will use them.
What I meant with independet dependency, would be to add that code as a separate entity that binds with the regular install (Maven dependency).

I do use an adblocker, but it works well with it. I just got a random “unavailable server” type of error, but only on my second visit. Probably it is specific to filters you use.
The pattern of the buggy filter can be useful to diagnose the problem.

The Java process wasn’t running on the server any more when I checked it.

I will have a look at the implementation of the 3-gram rules this week. One simple approach I have in mind is having the rules (but not the language model) in the default LT distribution, but disabled unless LT is started with a --neural-network-languagemodel parameter, which points to a directory containing the language models.

3 Likes

So, here’s a little update:

First of all, the demo page supports Portuguese now and the list of supported confusion pairs has been extended. (NB: The rules for Portuguese are not calibrated, so there can be lots of false alarms.)

It is no longer necessary to create a Java file for each confusion pair, but the rules are generated dynamically from a neuralnetwork/confusion_sets.txt file which has the same format as the ngram confusion_sets.txt file. This makes adding new rules much easier. Furthermore, the word2vec language models are no longer part of the LanguageTool zip, but are loaded from a folder given in the configuration (just like you can load ngram data from a directory).

It keeps looking better!
Many thanks for adding the Portuguese age. It seems pretty great now, so… is there a roadmap for inclusion with the main LT release? I am eager to start working with this.

@gulp21 Is there already a complete quality comparison with the existing ngram approach? As soon as the new approach is as good or better (or very close), we can start switching to it. Actually, we can integrate it even now and activate / make it default later.

Thank you for the great work. May I know does the commandline api support the neural network msg json output in the github yet? Thanks!

I will do a comparison as part of the first summary for my project work. I start writing it this week and will share it here. My feeling is that there are some structures where ngrams work better, and some others where the neural network works better.

The neural network rules should work with every API which also supports the languagemodel parameter/config entry, as long as the word2vec parameter/config entry is set correctly. I have noticed that there is some inconsistency with the names (some places use word2vecDir, others use word2vecmodel), which I will fix this week.

So here is the summary of my work. tl;dr: The neural network rules I’ve created are as good as the 3-gram rules, but need less memory and are faster.

I can prepare a pull request for including the neural network rules in LT.

That would be great. We should work on a smooth migration, so one day we can disable the ngram rule and enable the new rule, without the users noticing a difference.

BTW, about the low recall for schon/schön: are you sure that there’s no encoding bug in your setup? It’s the only pair with an umlaut and the only one that’s a lot worse than the ngram approach.

At the moment, the word2vec models (around 70 MB per language) are part of the neuralnetwork branch. Do you think it would be better to remove them and have a separate download page for them in order to keep the repository smaller?

I’ve just verified it: There are no encoding problems.

Yes, that’s better. We already have a very large repo size due to the other binary files we use.

OK, I will rebase my branch so that the big files are not part of the history. Should we ship the smaller files (13 KB per confusion set) together with the word2vec model, or put them into the resources folders of the language modules?

As the confusion set files cannot be used without the models, I think we can store them with the models?

I have created a pull request. The word2vec models can be downloaded here. Should we create a repository for the word2vec data?

There’s a directory bak~ in de/neuralnetwork/, is that on purpose? How have the confusion pairs been selected? I’m asking because not all confusions sets from the ngram confusion_sets.txt are covered yet, is that correct?

Well, not really on purpose. I simply compressed my word2vec folder, which also contains log files and weight files for my experiments which are not part of the final version.

You are referring to the confusion_sets.txt for the 3-gram rules, aren’t you? I have excluded those confusion pairs for which I didn’t have enough training data.

Will it be possible to get this working for Dutch without me having to edit the Dutch code? And if so, when and how?
It looks very promising, especially since there are a lot of good cases for Dutch.

You need to add two methods to the Java code as documented here. If you point me to a list of confusion pairs for Dutch I can run a big job on the computation cluster I have access to in order to test how good the confusion pairs work.

Tokens and words are used as equivalents in the text, except for the one remark. Better use tokens all the time. But even then, it is unclear whether it is LT tokens that are used. Dutch has a tuned tokenizer in LT that DOES see oma’s as 1 token, as it should.
So I guess the preparation of the files for the neural network should use the same tokenizer for correct results? So maybe the preparation should be using the same code as LT?