[de] POS dictionary neglected?

Discostu · December 19, 2017, 1:46pm

I don’t want to step on anyone’s toes, especially not those of @Jan_Schreiber who does a great job with all his additions to the German LanguageTool. But I’ve noticed something that I’d like to discuss.

Jan does almost daily add words to the German spelling.txt that have been suggested by users of the LanguageTool website. I think that this is a really important task. But I’ve noticed that additions to the added.txt are made much less often. Many new words in spelling.txt are automatically recognized by the POS tagger because they are compounds. But in every batch Jan adds there are some words that don’t have a POS tag (a recent example is the adjective “unflüssig”).

I think that the grammar checking of LanguageTool is as important as the spell checking, maybe even more so. Therefore, adding words to the added.txt should have at least the same priority as adding them to the spelling.txt. I even think that we should add fewer words to the spell checker if that means that we have more time to also add them to the POS dictionary.

I know that this post might be a little impudent by someone who doesn’t commit much more than three to four little things each month in comparison to the many hours others spend on improving LT. But now I’ve written it and am looking forward to your comments.

dnaber · December 19, 2017, 2:16pm

We’re getting help from Flexion | Suche nach Wortformen – korrekturen.de - they are using our POS dictionary and we can get updates from them. However, it’s a manual process for which I need some time. So if you want to extend added.txt, please make sure the word is also still missing from Flexion | Suche nach Wortformen – korrekturen.de so avoid duplicated work. Actually, reporting the word to korrekturen.de might make more sense, as they have some internal user interface for adding words and don’t need to edit text files.

Discostu · December 19, 2017, 2:35pm

Interesting, thanks for the info. Are you adding their words to added.txt or do you import them differently? Could you link to the last commit where you added words from them? It would be interesting to me.

I think I will keep adding words that are missing from both databases to added.txt instead of sending them to korrekturen.de. It’s quite satisfying to have LT correctly analyze the word the very next day

dnaber · December 19, 2017, 4:06pm

Updates happen with the script at GitHub - languagetool-org/german-pos-dict: German part-of-speech dictionary, then a new Maven release of that project is built and referenced from LT as a dependency. The reason for this workflow is 1) other developers might be interested in the data without the need for LT 2) we keep the binary files out of the LT repo.

Discostu · December 19, 2017, 4:20pm

So the last update of the dictionary was one year ago? Or do I misunderstand the process?

dnaber · December 19, 2017, 4:31pm

That’s right. Any help with this is welcome (I can do the part where this gets released as a Maven project, because only I can do that).

Discostu · December 19, 2017, 9:11pm

I’d love to but sadly don’t have enough time at the moment.