Neural Network Rules

gulp21 · October 13, 2017, 8:16pm

Hi,

as part of my project work at university, I looked into detecting confused words (e. g. to/too, than/then) using the word2vec model by Mikolov et al and neural networks (which get 5-grams as input). I’ve integrated some neural network word confusion rules into LanguageTool (source code) and a small online demo for German and English is available here. (The forms are sometimes buggy and in case my LanguageTool server crashes I have to restart it by hand.)

So in which way are the neural network based rules better than the existing 3-gram rules?

Smaller files: The zipped n-gram data for German has a size of 1.6 GB, whereas the language model for the neural network has a size of 83 MiB (+ 12 KB for each confusion pair).
The n-gram rules can only detect an error if the correct 3-gram is part of the corpus, but the neural network can also detect errors if a similar 5-gram was part of the training corpus.

And what are the disadvantages?

As always with neural networks: You cannot really say what they actually learned, what they “think”. We can only see that it works.
Compared with the 3-gram rules, recall is worse for the same level of precision. That’s why the rules are calibrated such that precision is > 99 % (> 99.9 % would be better for everyday use, though).

tiagosantos · October 14, 2017, 8:13am

Many thanks for sharing this here.

This is very promissing, since this allows the creation of reasonably sized language packs.
This seems the way to go, as long as the rule creation process is fast and easy to automatize.

If, my understanding of this technology is correct, rule creation is not fast, since you will need to tag a big corpus for each pair you are looking for.
Can you confirm if my guess is correct and give a time estimate for rule creation, as well a broad sumary of the task involved?

There are some issues that this solves, and I would like to try on the Portuguese module, but time is always a constrain. That information would help me decide.

PS - I tested yesterday and it was working great, but today the neural network server/text box is down.

Jan_Schreiber · October 14, 2017, 10:15am

Wow! The demo is really, really impressive.

dnaber · October 14, 2017, 10:20am

This is great! Do you have any idea whether results can be improved by training on larger corpora, e.g. by adding Wikipedia and tatoeba?

PS: I currently get “Error: Did not get response from service”

gulp21 · October 14, 2017, 10:23am

The most time-consuming task is creating the language model which is shared by all rules of one language. Creating the language model for English, which contains 42,121 tokens and was created from newspaper articles containing 7,298,836 tokens, took 40 minutes.

Training a new neural network for a confusion pair takes around 5 minutes, depending on how many sentences in the training corpus contain the tokens of the pair and if no GPU is available it will be slower. For example, of/off has been trained with around 30,000 sentences within 4 minutes.

I didn’t publish the code for training the neural network, yet, since I have to clean it up a bit and add some decent documentation. @tiagosantos If you want I can add support for some Portuguese confusion pairs, just give me few confusion pairs for Portuguese. I could take that as a chance to write good documentation for the training process and the initial setup.

I’ve just restarted my LanguageTool server. I do not really understand why it stops running every now and then.

gulp21 · October 14, 2017, 10:58am

Yes, I think the rules would benefit from larger corpora. Currently I use newspaper articles for learning and Wikipedia for validation. The problem with the newspaper corpus is that 1st and 2nd person verb forms are rare, so for instance training a network for the German word pair seid/seit didn’t work because there are only 49 sentences containing “seid”. Tatoeba has 667 sentences with “seid”, I’ll give it a try, but even more data would be better.

dnaber · October 14, 2017, 11:00am

Are you training on the plain text or can you train on ngrams? With ngrams you could use the Google ngram data.

Jan_Schreiber · October 14, 2017, 12:07pm

The same holds for Wikipedia.

As you probably know, Tatoeba contains a lot of common errors. From my understanding, this might harm the quality.

gulp21 · October 14, 2017, 2:17pm

Yeah, that’s true. I did a test with seid/seit using Tatoeba data and it works well. It’s now also available on the demo page.

As my current neural network design uses 5-grams as input, I could use the 5-grams for learning rules. The Google data needs some preprocessing, though, but I think it is worth looking into it, so I’ll try to write some code to convert the 5-gram data to my input format.

tiagosantos · October 14, 2017, 4:01pm

That would be a great opportunity to learn a bit more about how that works. Regarding neural networks, I only know the concept, and it really interests me. I also prefer using some only newspapers for rule validation and creation and I can point you to proper sources. I would stay away from Wikipedia and Tatooeba when creating the model, because they merge 4 different Portuguese standards, which would lower the quality of the model.

Google doesn’t have a publicly available Portuguese n-gram data set, even though I prepared Portuguese to be ready to accept n-gram data, if the user builds it or buys it online. Some disabled confusion pair can be found here:

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/confusion_set_candidates.txt

# Portuguese confusion sets, mostly homophones
# This file is not used by LanguageTool - to activate a pair, move it to confusion_sets.txt
# This list is a conversion of the paronyms list created by Konfekt, 
# licensed under WTFPL (https://en.wikipedia.org/wiki/WTFPL)
# https://gist.github.com/Konfekt/141b085ec247d8001058e4924954b402
#
# Factor 10000 produces on average good results without adjustments. 
# Users are welcome to test and report improved values for mainstream inclusion.
# Run ConfusionRuleEvaluator to obtain missing values.
#
# abade; abate; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abafo; abalo; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abalo; abano; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abano; abono; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abano; ébano; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# aberta; alerta; 10000;                           # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abeto; afeto; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abjeção; objeção; 10000;                         # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abordo; aborto; 10000;                           # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# aborto; acordo; 10000;                           # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY

This file has been truncated. show original

NOTE: The header says # English confusion sets,, because I forgot to change that when creating the dummy file. The pairs are Portuguese.

The best pair for me are not there. e->é and por->pôr pairs are best suited for this rule, since pattern rules are not good detecting these confusions.

The best free corpora I know for this task is the DCEP: Digital Corpus of the European Parliament.
It is available for all major European Languages, and even some minor ones.
You can find on DCEP: Digital Corpus of the European Parliament - European Commission

If you guide me into the process, I can do some grunt work, and I will definitely give my documentation feedback if you wish to.

gulp21 · October 14, 2017, 5:10pm

I had look at 1/4 of Google’s German 5-gram data and the seid/seit pair: There are only around 20 5-grams where “seid“ is in the middle, so the whole corpus probably has less 100 usable 5-ngrams, which is less than in Tatoeba.

Thank you for the link, but unfortunately, the download is very slow and stops after a while.
I will have a look at Portuguese (Portugal) tomorrow, using data from Downloads unless you point me to another corpus.

tiagosantos · October 14, 2017, 7:44pm

I am not aware of any other better big corpus. Personally, I also use newspaper as my makor source of validation, but it would be too time consuming for you to compile articles for this task.
The resource you pointed out is great. Many thanks. Added to my list.

Ruud_Baars · October 15, 2017, 6:14am

Hi, this is indeed very promising and challenging. I would be interested for Dutch, not just for confusement pairs, but for every deviation from ‘common language use’.

But apart from that, I have reasonably large text collection for quite some language. Dirty though, but they can all be polished by checking the amount of spelling errors per source, and individual lines with spelling an grammar errors could be removed to clean up further.

Furthermore, I would like to learn more on how this work is being done!

Could this also be used for finding incorrecty separated words (English desease)?

jeblad · October 15, 2017, 3:00pm

I can do training on a Norwegian (can not share the corpus), and probably also on Swedish and Danish (must ask the provider).

On corpus with 1st and 2nd person verb forms; I can probably get access to an even larger Norwegian corpus, but I’m not quite sure about the quality as it is scanned and OCR read.

It could be easier for this specific use to try to find anomalies instead of correct sentence structures. Anomalies would be non-traversable patterns in the model.

Jan_Schreiber · October 15, 2017, 3:19pm

That would be great. This is an important problem for (I believe) Dutch and (certainly) German.

gulp21 · October 15, 2017, 7:54pm

So, source code and readme for the neural network are available here.

I created a language model for Portuguese and added a por/pôr rule. I don’t speak Portuguese, but I think the result is good:

(sentences from tatoeba)

This is not possible with the current architecture, which suffers from the same problem as the n-gram rules: The input is tokenized and the neural network gets 4 tokens of the context as input and tries to get the word in the middle right.

Ruud_Baars · October 15, 2017, 8:05pm

I know nothing of the methods. Can you explain why ngrams are used an not full sentences? And why is the middle relevant?

tiagosantos · October 16, 2017, 9:02am

It is working great. This is awesome and you choose a good pair.
When I gave you the list, I forgot that 4 min * 3000 pairs is a lot of computing time, especially when those pairs were no triaged for usefulness.

Your documentation is great and easy to follow, but there are still a few questions pending.

Are you planing on integrating this later with LT, maybe as an independent dependency? Since there are server crashes, maybe due to user misuse, are your going to work further on this project and is the “rules” profile final?

Sorry for making so many questions. I would really like to use this technology, but before starting, I would like to know how final this is at the moment.

gulp21 · October 16, 2017, 8:22pm

This is my first experiment with detecting grammar errors using machine learning technologies. I wanted to start with something simple, so I took the task of detecting confusion of words using 5-grams as input. Using whole sentences as input is also possible, but requires a different neural network architecture; I think recurrent neural networks (which are capable of amazing things) are a good starting point, but I do not have much experience with this kind of neural networks.

OK, 3000 pairs is a lot. I already felt that the rule creation workflow with creating a new Java class for each confusion pair is not ideal. I plan to change it and make it similar to the approach we use for the 3-gram rules, such that you only have to copy the network txt output to the resources folder and write the calibration parameter into a text file (or maybe even semi-automatic calibration). Maybe I have time to implement it this week.

It would be great if my work could make it into LanguageTool. Probably not as part of the official distribution, because the language models are still quite big, but maybe as part of the web service. I’m not sure how to implement it as optional dependency, though, but I think it could be done similar to the optional 3-gram rules.

And concerning the stability of the demo page: I don’t know what is going on on the server. Nothing suspicious in the logs and no hint that LanguageTool has crashed.

dnaber · October 16, 2017, 8:42pm

When I first mentioned it didn’t work for me it was actually my ad-blocker (NoScript). For some reason, it gives me an error even when I turn off the blocker for your site. Maybe others have similar issues.