Corrupted characters in dumped dictionaries

Hello,

I need to export dictionaries content to text format, which will include the inflected form and the base form of each word.
I have tried to do it with morfologik fsa_dump command as well as with DictionaryExporter from Languagetool 2.4 Snapshot. For both cases I have got files with corrupted diacritical characters. I have tried to dump dictionaries for Polish, Slovak or Romanian.

I have tried to open the dumped dictionaries in Notepad++ and EditPad Lite, but characters are not displayed correctly in any possible encoding.
What is encoding of dictionaries dumped this way? Or what should I do to get correct characters in dumped dictionaries?

Regards,
DArek

The encoding is given in the *.info file, e.g. polish.info. When I export polish.dict using DictionaryExporter, I get an UTF-8 file with correct encoding. Please let me know if that doesn’t work for you.

I have utf-8 encoding in polish.info file (fsa.dict.encoding=utf-8), but Polish characters in output file are not displayed correctly, for example there are ‘ADM-├│w’ insted ‘ADM-ów’.

Does the line ‘fsa.dict.encoding=utf-8’ tell what was encoded FSA dictionary or what will be encoded the exported dictionary? I used standard dictionaries from Languagetool package (org\languagetool\resources).

fsa.dict.encoding=utf-8 tells Morfologik (which we use internally) how to interpret the data in the binary file, I think. Does it help if you call the export command with “-Dfile.encoding=utf8”? Are you sure the editor you open the file with works with UTF-8?

I checked your suggestion about -Dfile.encoding command and results are strange for me. I don’t know Java well.
I added ‘-Dfile.encoding=UTF8’ to JAVA_TOOL_OPTIONS. When I check ‘java -version’ I get the message that everything is OK (Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8), but there are still corrupted chars in exported dicts.

When I run command: java -Dfile.encoding=UTF8 -cp languagetool-standalone.jar org.languagetool.dev.DictionaryExporter org/languagetool/resource/pl/polish.dict >dict_PL.txt I get the message: Could not find or load main class .encoding=UTF8.

I think it may be problem with my Java configuration.
My editors (I have checked all I have) works fine rather. Both EditPad Lite and Notepad++ as well as Windows Notepad don’t display exported dicts properly.

This will only work with the current snapshots, not with LT 2.3. Also, you need to call this in the directory where languagetool-standalone.jar is in.