Back to LanguageTool Homepage - Privacy - Imprint

Corrupted characters in dumped dictionaries

(DArek) #1


I need to export dictionaries content to text format, which will include the inflected form and the base form of each word.
I have tried to do it with morfologik fsa_dump command as well as with DictionaryExporter from Languagetool 2.4 Snapshot. For both cases I have got files with corrupted diacritical characters. I have tried to dump dictionaries for Polish, Slovak or Romanian.

I have tried to open the dumped dictionaries in Notepad++ and EditPad Lite, but characters are not displayed correctly in any possible encoding.
What is encoding of dictionaries dumped this way? Or what should I do to get correct characters in dumped dictionaries?


(Daniel Naber) #2

The encoding is given in the *.info file, e.g. When I export polish.dict using DictionaryExporter, I get an UTF-8 file with correct encoding. Please let me know if that doesn't work for you.

(DArek) #3

I have utf-8 encoding in file (fsa.dict.encoding=utf-8), but Polish characters in output file are not displayed correctly, for example there are 'ADM-├│w' insted 'ADM-ów'.

Does the line 'fsa.dict.encoding=utf-8' tell what was encoded FSA dictionary or what will be encoded the exported dictionary? I used standard dictionaries from Languagetool package (org\languagetool\resources).

(Daniel Naber) #4

fsa.dict.encoding=utf-8 tells Morfologik (which we use internally) how to interpret the data in the binary file, I think. Does it help if you call the export command with "-Dfile.encoding=utf8"? Are you sure the editor you open the file with works with UTF-8?

(DArek) #5

I checked your suggestion about -Dfile.encoding command and results are strange for me. I don't know Java well.
I added '-Dfile.encoding=UTF8' to JAVA_TOOL_OPTIONS. When I check 'java -version' I get the message that everything is OK (Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8), but there are still corrupted chars in exported dicts.

When I run command: java -Dfile.encoding=UTF8 -cp languagetool-standalone.jar org/languagetool/resource/pl/polish.dict >dict_PL.txt I get the message: Could not find or load main class .encoding=UTF8.

I think it may be problem with my Java configuration.
My editors (I have checked all I have) works fine rather. Both EditPad Lite and Notepad++ as well as Windows Notepad don't display exported dicts properly.

(Daniel Naber) #6

This will only work with the current snapshots, not with LT 2.3. Also, you need to call this in the directory where languagetool-standalone.jar is in.