Problem while creating synthesizer dictionary

aafreen · December 4, 2018, 9:18am

I am trying to modify an English synthesizer dictionary as mentioned in the below site.

http://wiki.languagetool.org/developing-a-tagger-dictionary#toc4

I have created in 3-tab separated format. Dictionary built successfully. But, I cannot use the dictionary. I am getting no replacement when I am using the newly built dictionary.

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i english_synth.dict -info english_synth.info -o dictionary.dump

java -cp languagetool.jar org.languagetool.tools.SynthDictionaryBuilder -i dictionary.dump -info org/languagetool/resource/en/english_synth.info -o result.dict

Please let me know If I did anything incorrectly while building the dictionary.

Yakov · December 4, 2018, 3:46pm

It is necessary to export words from the standard dictionary english.dict, not from the synthesized english_synth.dict dictionary.

aafreen · December 5, 2018, 12:41pm

Thank you for your response Yakov. But I am not trying to modify the spell checker dictionary. I am trying to modify the synthesizer dictionary. Please let me know if you have any options regarding this.

Yakov · December 5, 2018, 4:03pm

For synthesizer dictionary need extract POStag dictionary english.dict from /languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i org/languagetool/resource/en/english.dict -info org/languagetool/resource/en/english.info -o dictionary.dump

and build synth dictionary:

java -cp languagetool.jar org.languagetool.tools.SynthDictionaryBuilder -i dictionary.dump -info org/languagetool/resource/en/english_synth.info -o result_synth.dict

Format of dictionary.dump

boyar boyar NN
boyard boyard NN

If the file dictionary.dump format is different, the resulting dictionary will be broken.

aafreen · December 6, 2018, 5:24am

Thank you very much Yakov. I will try this immediately.

aafreen · December 6, 2018, 11:50am

This method works fine. Thank you.
I have seen that there is also an option to export spell checker dictionary.
But, I am facing while trying to export en_US.dict.
I guess, you can help me in this too.

Yakov · December 10, 2018, 10:08am

For spellchecker dictionary with frequency data is impossible export word list due bug.
English spellcheck dictionaries contain frequency data.
But you can get word list from source hunspell dictionary using hunspell utillity unmunch.

./unmunch en_US.dic en_US.aff > en_US1.txt

You can get source en_US dictionary from:

https://github.com/marcoagpinto/aoo-mozilla-en-dict/tree/master/en_US%20(Kevin%20Atkinson)

aafreen · December 24, 2018, 5:23am

Please let me, how you are creating en_US.dict file using en_US.dic and en_US.aff files, if possible.

Ruud_Baars · December 24, 2018, 6:39am

First, like Yakov said:

Then follow this recipe:
http://wiki.languagetool.org/hunspell-support

Yakov · December 24, 2018, 8:07am

For English dictionary you can use script
make_en_us_dict.sh

This script extracts the hunspell dictionary and removes useless words from dictionary (like “isn’t”, “aren’t” ) and creates new dictionary with words frequency data from https://github.com/mozilla-b2g/gaia/raw/master/apps/keyboard/js/imes/latin/dictionaries/en_us_wordlist.xml

Yakov · December 24, 2018, 8:11am

This script requires languagetool-dev-4.2-SNAPSHOT.jar from languagetool-dev package and unmunch from Hunspell.

aafreen · December 24, 2018, 9:38am

Thank you Yakov. I will try this way to create dict file.

Ruud_Baars · December 24, 2018, 2:38pm

@Yakov: There are bugs in unmuch, i.e. it does not support the full feature set of Hunspell, only the affix part of it. So it can result in words that are actually wrong.
The step needed to assure this is not the case is spellcheck the resulting file using Hunspell and remove the words reported by -L (wrong) , e.g.:

unmunch en_AU.dic en_AU.aff > en_AU1.txt

 hunspell -d en_AU -L en_AU1.txt > wrong.txt
 while read line; do cat en_AU1.txt | grep -a -v "^$line$" > tmp.txt; mv tmp.txt en_AU1.txt ; done < wrong.txt

cat en_AU1.txt | java -cp languagetool.jar:languagetool-dev-4.0-SNAPSHOT.jar org.languagetool.dev.WordTokenizer en | sort -u > en_AU.txt
java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder -i en_AU.txt -info en_AU.info -freq en_us_wordlist.xml -o en_AU_spell.dict

Yakov · January 4, 2019, 4:43pm

I updated the build scripts for dictionaries to avoid unmunch bug:

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/make_en_au_dict.sh

#!/bin/sh
# Get dictionaries from http://wordlist.aspell.net/dicts/
# Get en_us_wordlist.xml from https://github.com/mozilla-b2g/gaia/raw/master/apps/keyboard/js/imes/latin/dictionaries/en_us_wordlist.xml

unmunch en_AU.dic en_AU.aff > en_AU1.txt
cat en_AU1.txt spelling_merged.txt | java -cp languagetool.jar:languagetool-dev-5.3-SNAPSHOT.jar org.languagetool.dev.archive.WordTokenizer en | sort -u > en_AU.txt
java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder -i en_AU.txt -info en_AU.info -freq en_us_wordlist.xml  -o en_AU_spell.dict

aafreen · January 8, 2019, 9:16am

Thank you very much Yakov. I will try this.

aafreen · January 8, 2019, 9:52am

Please let me know, From which path I need to execute this command.

Yakov · January 10, 2019, 7:04am

You can use updated script version for en_US directly:
make_en_us_dict.sh

aafreen · January 22, 2019, 9:57am

I am getting the below error. Please let me know, how to resolve this?

Error: Could not find or load main class org.languagetool.dev.archive.WordTokenizer

Yakov · January 25, 2019, 6:21am

What version LT are you using?
main class org.languagetool.dev.archive.WordTokenizer exist in LT4.2-SNAPSHOT and higher

In LT 4.0 and below, this class has the name is org.languagetool.dev.WordTokenizer

aafreen · January 25, 2019, 7:43am

The error is occuring in LT 4.4.
Could you please tell me from which path I have to execute this command?