Problem while creating synthesizer dictionary

I am trying to modify an English synthesizer dictionary as mentioned in the below site.

http://wiki.languagetool.org/developing-a-tagger-dictionary#toc4

I have created in 3-tab separated format. Dictionary built successfully. But, I cannot use the dictionary. I am getting no replacement when I am using the newly built dictionary.

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i english_synth.dict -info english_synth.info -o dictionary.dump

java -cp languagetool.jar org.languagetool.tools.SynthDictionaryBuilder -i dictionary.dump -info org/languagetool/resource/en/english_synth.info -o result.dict

Please let me know If I did anything incorrectly while building the dictionary.

It is necessary to export words from the standard dictionary english.dict, not from the synthesized english_synth.dict dictionary.

Thank you for your response Yakov. But I am not trying to modify the spell checker dictionary. I am trying to modify the synthesizer dictionary. Please let me know if you have any options regarding this.

For synthesizer dictionary need extract POStag dictionary english.dict from /languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i org/languagetool/resource/en/english.dict -info org/languagetool/resource/en/english.info -o dictionary.dump

and build synth dictionary:

java -cp languagetool.jar org.languagetool.tools.SynthDictionaryBuilder -i dictionary.dump -info org/languagetool/resource/en/english_synth.info -o result_synth.dict

Format of dictionary.dump

boyar boyar NN
boyard boyard NN

If the file dictionary.dump format is different, the resulting dictionary will be broken.

1 Like

Thank you very much Yakov. I will try this immediately.

This method works fine. Thank you.
I have seen that there is also an option to export spell checker dictionary.
But, I am facing while trying to export en_US.dict.
I guess, you can help me in this too.

For spellchecker dictionary with frequency data is impossible export word list due bug.
English spellcheck dictionaries contain frequency data.
But you can get word list from source hunspell dictionary using hunspell utillity unmunch.

./unmunch en_US.dic en_US.aff > en_US1.txt

You can get source en_US dictionary from:

https://github.com/marcoagpinto/aoo-mozilla-en-dict/tree/master/en_US%20(Kevin%20Atkinson)

Please let me, how you are creating en_US.dict file using en_US.dic and en_US.aff files, if possible.

First, like Yakov said:

Then follow this recipe:
http://wiki.languagetool.org/hunspell-support

For English dictionary you can use script
make_en_us_dict.sh

This script extracts the hunspell dictionary and removes useless words from dictionary (like “isn’t”, “aren’t” ) and creates new dictionary with words frequency data from https://github.com/mozilla-b2g/gaia/raw/master/apps/keyboard/js/imes/latin/dictionaries/en_us_wordlist.xml

This script requires languagetool-dev-4.2-SNAPSHOT.jar from languagetool-dev package and unmunch from Hunspell.

Thank you Yakov. I will try this way to create dict file.

@Yakov: There are bugs in unmuch, i.e. it does not support the full feature set of Hunspell, only the affix part of it. So it can result in words that are actually wrong.
The step needed to assure this is not the case is spellcheck the resulting file using Hunspell and remove the words reported by -L (wrong) , e.g.:

unmunch en_AU.dic en_AU.aff > en_AU1.txt

 hunspell -d en_AU -L en_AU1.txt > wrong.txt
 while read line; do cat en_AU1.txt | grep -a -v "^$line$" > tmp.txt; mv tmp.txt en_AU1.txt ; done < wrong.txt

cat en_AU1.txt | java -cp languagetool.jar:languagetool-dev-4.0-SNAPSHOT.jar org.languagetool.dev.WordTokenizer en | sort -u > en_AU.txt
java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder -i en_AU.txt -info en_AU.info -freq en_us_wordlist.xml -o en_AU_spell.dict

I updated the build scripts for dictionaries to avoid unmunch bug:

1 Like

Thank you very much Yakov. I will try this.

Please let me know, From which path I need to execute this command.

You can use updated script version for en_US directly:
make_en_us_dict.sh

I am getting the below error. Please let me know, how to resolve this?

Error: Could not find or load main class org.languagetool.dev.archive.WordTokenizer

What version LT are you using?
main class org.languagetool.dev.archive.WordTokenizer exist in LT4.2-SNAPSHOT and higher

In LT 4.0 and below, this class has the name is org.languagetool.dev.WordTokenizer

The error is occuring in LT 4.4.
Could you please tell me from which path I have to execute this command?