Back to LanguageTool Homepage - Privacy - Imprint

Developing a tagger dictionary for Icelandic


(starkadur) #1

I am trying to follow the instructions on http://wiki.languagetool.org/developing-a-tagger-dictionary but I run in to problems already with the first step.

I have created a test-file with two lines, each line including the inflected form, the base form and the pos-tag, separated by tabs. I have tried to save it as AINSI, UTF-8 and also converted it using dos2unix. But I always get the same error when I try to export the data running the command
java -cp languagetool.jar org.languagetool.dev.DictionaryExporter org/languagetool/resource/is/icelandic.dict >dictionary.dump

The error I get is:

Unhandled program error occurred.
java.io.IOException: Invalid file header magic bytes.
at morfologik.fsa.FSAHeader.read(FSAHeader.java:42)
at morfologik.fsa.FSA.read(FSA.java:262)
at morfologik.tools.FSADumpTool.dump(FSADumpTool.java:112)
at morfologik.tools.FSADumpTool.go(FSADumpTool.java:75)
at morfologik.tools.Tool.go(Tool.java:45)
at morfologik.tools.FSADumpTool.main(FSADumpTool.java:286)
at org.languagetool.dev.DictionaryExporter.main(DictionaryExporter.java: 41)

Do you now what could be the problem?

Many thanks,
StarkaĆ°ur


(Daniel Naber) #2

Could you attach your files here, preferably zipped so the software doesn't change them? (you can use "More -> Upload file" for that).


(starkadur) #3

Thanks for you quick reply. Here is one version of the file (utf-8).icelandic.7z (156 Bytes)


(Daniel Naber) #4

You shouldn't call the input *.dict, as that's already the name of the binary file that needs to be generated (using the POSDictionaryBuilder command on the wiki page). However, the next issue will be that Icelandic never had a tagger, so even if you have a file *.dict file, LanguageTool will ignore it. Are you familiar with Java? The file Icelandic.java will need to be adapted to use IcelandicTagger.java (which needs to be created, but can almost be a copy of FrenchTagger.java).


(starkadur) #5

I am a bit confused about this command line that is given as an example on the wiki page (http://wiki.languagetool.org/developing-a-tagger-dictionary):

java -cp languagetool.jar org.languagetool.dev.DictionaryExporter org/languagetool/resource/en/english.dict >dictionary.dump

Here it seems that a file called english.dict is used to create dictionary.dump. The next step seems to use dictionay.dump to create a temporary file, using POSDictionaryBuilder, that is then renamed to *.dict.

It doesn't seem to matter what I call the file, I always get the same error (Invalid file header..."). If, on the other hand, I run the command using the file english.dict in resource/en then it works.

I don't have a lot of experience with java. But if I want to look at FrenchTagger.java, then where do I find it (I have found FrenchTagger.class)?


(Daniel Naber) #6

The first command on the wiki is for exporting an existing tagger dictionary. As LT doesn't have such a dictionary for Icelandic, you won't be able to call that command unless you've built one yourself (using the second command on the Wiki, POSDictionaryBuilder).

You can look at FrenchTagger.java at https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/fr/src/main/java/org/languagetool/tagging/fr/FrenchTagger.java


(starkadur) #7

I managaed to create the *.dict file. I have taken the FrenchTagger.java and modified it:

package org.languagetool.tagging.is;

import java.util.Locale;

import org.languagetool.tagging.BaseTagger;

public class IcelandicTagger extends BaseTagger {

@Override
public String getManualAdditionsFileName() {
return "/is/added.txt";
}

public IcelandicTagger() {
super("/is/icelandic.dict", Locale.English, false);
}
}

The three errors I get when compiling are all related to BaseTagger which comes as no surprise since I haven't been able to locate it. I looked in the folder org.languagetool.tagging but did not find it. As I said I am not a very advanced programmer so probably simple things are getting in my way. But if there are any simple answer to my problem it would be appreciated.


(Daniel Naber) #8

How exactly did you try to compile the code? Here's some documentation: http://wiki.languagetool.org/development-overview#toc2 - basically it's just calling "mvn clean package".