The document states you will have to rebuild LT for every AI rule. That makes it unpleasant for non-programmers like me.
Thanks for the remark, I will change the text and make clear that it always refers to LT tokens. That's why the corpus must be tokenized by LT before neural network training can start.
As the neural network rules are now separate from the LT package, recompilation is no longer necessary. I made some changes to the code anyway (not yet published), I will have updated the readme by the end of the week.
@dnaber I will work on the remarks in the pull request today.
That sounds great.
I guess I will first have to prepare a 'corpus', which probably means a text file with a sentence a line. Or will a paragraph per line also do?
Yes, you need a large corpus. It's not important if it's one sentence per line or one paragraph per line.
I have got a corpus. Next step is tokenizing. There is nothing about tokenizing text in the LT help. How is that done? If space is the separator, there is no need for tokenizing I guess.
See the instructions here. I've updated them, but haven't had the time to test them on a clean system, though.
I am sorry, but this is all written for (Java) developers. The development environment for LT has become much to difficult for me.
@gulp21 Sorry if I missed this in the documentation, but is there a suggested minimum number of occurrences to add a new pair? What's the number you have been using?
If there are at least 500 occurrences for each token of a confusion pair, you should get good results, but more is usually better. The number also depends on the number of ways the token can be used (e.g. the German word “sein” can be a verb or a pronoun, so both usages must be covered by often enough by the training corpus).
@gulp21 thanks, I have two more questions:
nn_words.pydo the same evaluation that we would do in LT? I.e. is it enough to get good values from the Python script, or is further evaluation inside LT needed?
- I see the evaluation values change a lot when
nn_words.pyis called more than once with the same input. One reason is that
shufuses a different random seed every time. Is that on purpose? When I use the same seed, values vary a lot less, but still. Is this due to random initial values when training?
Calling validate_error_detection(suggestion_threshold=t, error_threshold=t) is basically the same as having the rule in LT with a threshold of t. So it is enough to get good values from the Python program, and if you try different thresholds in Python, you can use an appropriate threshold for the rule in LT.
Yes, the neural network is initialized with random values and the training/test split isn’t deterministic. This randomness is used to be able to assess the general structure of the neural network by feeding it repeatedly with the same data. If you just want to use the network, there is no harm in using a fixed random seed.
I have to correct my statement: This function is tailored to evaluation of confusion sets with more than 1 token, where more than 1 token could be correct. The best indicator for the performance of a confusion pair rule is the validate function, but it does not use the same algorithm as LT (basically because it calculates precision and recall for the task of choosing the best work, not for detecting the wrong word).
I decided to use an Amazon Web Services (AWS) instance to try to create my own word2vec stuff and found the neural network readme helped, but didn’t get me all the way there.
So below is the steps I’ve used that should get you to the end of the “Creating a language model” step…
Requirements = An AWS account. Knowledge of AWS, SSH, Linux command line.
Issues = Not everything is explained, so if it goes wrong you’re on your own. Others can flesh this out if it’s useful.
Create an EC2 instance Using a ‘Deep Learning AMI (Ubuntu) Version 3.0’, other versions may work, but to be certain these instructions are for the AMI image ‘ami-6d385e14’. I used a ‘large’, 64-bit instance without GPU. (I have not verified any other instance size, too little memory may cause issues, but I don’t know).
Lock down the security group so only you can SSH into it. No other ports need to be open. (Be safe).
SSH into the instance.
Yes - we’re doing everything as ROOT - I do NOT recommend doing this, but I didn’t have time to figure out the normal user way of installing everything in the right place. I’m sure others will point out the best way to fix this.
apt-get install gradle python3-pip python3-dev
Only do ONE of the following…
The instance I used above does not have any GPUs, so issue the following command, skip the GPU one.
pip3 install tensorflow
-OR- If you are using an instance with GPUs, do not use the above command, instead use this one. (I haven’t tested this!)
pip3 install tensorflow-gpu
git clone https://github.com/gulp21/languagetool-neural-network.git
Change directory into the “languagetool-neural-network” folder that’s created by the clone command above.
You need to have or create your own corpus.
I used the ‘tatoeba’ corpus which I had extracted all the English sentences from. (Removed ID numbers and language code from each line). Let’s call it “corpus.txt”. If you use the “example-corpus.txt” you’ll get an error at the end because there’s not enough data in it.
sed -E "s/^[0-9]+\W+//" corpus.txt > training_corpus.txt
shuf training_corpus.txt | head -n300000 > language_model_corpus.txt
./gradlew tokenizeFile -PlanguageCode="en-US" -PsentencesFile="language_model_corpus.txt"
Change the lines that look like…
… to …
TF_INC=$(python3 -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
TF_LIB=$(python3 -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())')
g++ -std=c++11 -fPIC -O2 -D_GLIBCXX_USE_CXX11_ABI=0 -shared -o word2vec_ops.so word2vec_ops.cc word2vec_kernels.cc -I $TF_INC -L$TF_LIB -ltensorflow_framework
python3 src/main/python/embedding/word2vec.py --train_data language_model_corpus.txt-tokens --eval_data src/main/python/embedding/question-words.txt --save_path . --epochs_to_train 10
The biggest issues were the ‘g++’ step, until I altered the ‘mutex.h’ file, added the ‘TF_LIB’ and ‘LD_LIBRARY_PATH’ stuff things would compile but not work.
I hope this is useful to someone else.
I found the README.md very useful, so thanks for doing all the hard work Markus/gulp21!!
I finally started fiddling with this, though progress is slow, since I am too dispersed across projects. Not enough time to study and be productive.
I noticed one simple thing that can be addressed, and that will greatly improve this extension usefulness.
The help button in the options directs correctly to you page, where nearly all the steps to create new neural networks are available, but I believe that a link to the downloadable files you created is not found there. It is referred in LanguageTool CHANGES.md, where you introduced the feature, though.
Neural network based rules for confusion pair disambiguation using the word2vec model are available for English, German, and Portuguese. The necessary data must be downloaded separately from https://fscs.hhu.de/languagetool/word2vec.tar.gz. For details, please see:
Forum discussion: Neural Network Rules
Paper: “Development of neural network based rules for confusion set disambiguation in LanguageTool” by Markus Brenneis and Sebastian Krings: https://fscs.hhu.de/languagetool/summary.pdf
Is it possible to add a reference and link to the pre-packaged files in your project README.md? This would allow more people to take advantage this great work, until further sets are created or improved.
I’ve added a section for users of LT to the README, thank you for the suggestion. The repository now also contains some experimental code, which shouldn’t affect the methods described in the README, so I hope I didn’t break anything.
You mean that I download https://fscs.hhu.de/languagetool/word2vec.tar.gz, re-zip its sub-directories and upload them separately at http://languagetool.org/download/, is that correct? If so, I’ll do that.
That’s correct. Thank you.
Done, they are now at http://languagetool.org/download/word2vec/. I have also removed the
bak~ directory that was in some of the sub directories.
Can I just clear up one point of confusion to me…
Is it sufficient to just have the “dictionary.txt” and final “final_embeddings.txt” files that are in the various word2vec archives to train new “confusion_sets.txt” candidates?
Would the average developer not also need the whole set of setences that were used in creation of those files? That is - would the language specific wiki dump XML file and tatoeba sentences etc. used when making the dictionary/embeddings not also be required?
To be specific… Would the “training-corpus.txt” file not need to match? I guess another training corpus could be used, but then the dictionary may not contain all the words used in the corpus - is this an issue?
Thanks in advance for your help!