I’m trying to generate a spelling file for Asturian from a plain text UTF-8 file containing 98,025,136 entries (1.8 Gb size). When I run create_dict.sh (includes java […] SpellDictionaryBuilder), the process is not completed with the following errors:
Final size:
98025136 /tmp/lt-dictionary.new
Running Morfologik FSACompile.main with these options: [--exit, false, -i, /tmp/SpellDictionaryBuilder3684917049841591582.txt, -o, /tmp/ast_ES.dict, -f, CFSA2]
An unhandled exception occurred. Stack trace below.
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3689)
at java.base/java.util.ArrayList.grow(ArrayList.java:238)
at java.base/java.util.ArrayList.grow(ArrayList.java:243)
at java.base/java.util.ArrayList.add(ArrayList.java:486)
at java.base/java.util.ArrayList.add(ArrayList.java:499)
at morfologik.tools.BinaryInput$1.process(BinaryInput.java:82)
at morfologik.tools.BinaryInput.forAllLines(BinaryInput.java:111)
at morfologik.tools.BinaryInput.readBinarySequences(BinaryInput.java:66)
at morfologik.tools.FSACompile.call(FSACompile.java:64)
at morfologik.tools.FSACompile.call(FSACompile.java:21)
at morfologik.tools.CliTool.main(CliTool.java:133)
at morfologik.tools.FSACompile.main(FSACompile.java:78)
at org.languagetool.tools.DictionaryBuilder.buildFSA(DictionaryBuilder.java:105)
at org.languagetool.tools.SpellDictionaryBuilder.build(SpellDictionaryBuilder.java:72)
at org.languagetool.tools.SpellDictionaryBuilder.main(SpellDictionaryBuilder.java:65)
Done. The binary dictionary has been written to /tmp/ast_ES.dict
Do you know how could I get the binary file with such a huge text file without problems? Perhaps is because of lack of disk space (I’ve got 11 Gb of free disk space)? Thanks!
Are there really so many forms in Asturian, including all inflected forms? Most languages have around 1 million forms. A few languages have 3-4 millions forms. Even including all apostrophized forms and clitics, you can get 10 million forms in some language. 98 million entries seem out of range. Is that correct?
java.lang.OutOfMemoryError
That suggests that it is a problem of lack of RAM, not disk space.
Thanks, Jaume, for your reply. Perhaps the amount of extra words, appart from all inflected forms, all their clitics of verbs and all apostrophized forms of the words (there are a lot of apostrophes “el, la, en, pa, per, por, de, que, me, te, se” and “el” also apostrophize behind the words, and there are a lot of contractions), is because of the diminutives and augmentatives suffixes added in Hunspell. There are 5 implemented suffixes of this kind (-ín, -ina, -ino, -inos, -ines / -ucu, -uca, -uco, -ucos, -uques / -iquín, -iquina, -iquino, -iquinos, -iquines / -ón, -ona, -ono, -onos, -ones / -acu, -aca, -aco, -acos, -aques) to each of the inflected forms of names and adjectives.
Should I remove this functionality to save space? Does other brother languages have this feature disabled?
I’ve removed diminutive and augmentative forms, and several rules concerning apostrophes and uppercase. With these changes I’ve got a raw file of 51 millions of entries (900 Mb). I’ve increased the amount of RAM from 8 to 12 Gb (I’m working with a VirtualBox machine) and could fix the problem. Final binary file has a size of 600 kb. I’ve tested it with a LibreOffice extension oxt and works fine. I’m going to upload a pull request in the main project.
Thanks for the help!
In other Romance languages, usually we don’t create all possible diminutives, augmentatives or prefixes. This has pros and cons. We avoid nonsense and rare words, but we must actively search for common derivatives in the corpus.
As for apostrophations, we rely on better tokenization to avoid adding all forms. Sometimes there is a trade-off between the number of forms in the spelling dictionary and good spelling suggestions.