[de] Improvement of POS analysis

Discostu · August 25, 2017, 6:34am

It’s great that it’s possible to use the POS analysis of LanguageTool when creating new rules. But the analysis itself is far from perfect. If I enter a sentence like

Zwei Putzfrauen wurde der Regenschirm gestohlen.

the POS tagger does not know if “Putzfrauen” is an accusative, dative, genitive or nominative plural of the lemma “Putzfrau”. This leads to the problem that the gender neutral alternative “Reinigungskraft” is suggested in the wrong case (accusative instead of dative). Additionally the tagger thinks that “Regenschirm” is the nominative case of the lemma but in this case it’s the accusative (maybe it generally has a problem with sentences that are missing a subject?).

So my question is: Is there a possibility to improve the POS analysis?

tiagosantos · August 25, 2017, 9:09am

I can think of 3 ways:

Case by case scenario: Adding words to the POS tagger

wiki.languagetool.org/tips-and-tricks#toc24

Develop the POS dictionary:

http://wiki.languagetool.org/developing-a-tagger-dictionary

Improve the tagger:

Jan_Schreiber · August 25, 2017, 9:52am

As a workaround, you can add several suggestions and leave the job of picking the right one to the user, like so:

<message>Möchten Sie 'Putzfrau' durch eines der geschlechtsneutralen Wörter <suggestion>Reinigungskraft</suggestion>, <suggestion>Reinigungskräfte</suggestion> oder <suggestion>Reinigungskräften</suggestion> ersetzen?</message>

Or make a rulegroup that can at least distinguish between plural and singular.

Discostu · August 25, 2017, 10:09am

@tiagosantos Adding words to the POS tagger wouldn’t help in this case because the word is correctly tagged when ignoring the context. Sadly my technical knowledge isn’t high enough to use the other two ways

This of course is the much easier way. But I have to admit that your suggestion isn’t very satisfying either. If somebody uses LT because he’s unsure about German grammar, it won’t help him very much if he must find out himself if he needs the accusative or the dative form of “Reinigungskraft”.

Because in the end this really hasn’t much to do with suggestions at all. The problem is that LT doesn’t realize that the sentence “Zwei Reinigungskräfte wurde der Regenschirm gestohlen.” is wrong. This should be corrected and then (I hope) the suggestions would automatically be corrected too.

Discostu · August 25, 2017, 10:15am

The rule does distinguish that already. I used <token inflected="yes"> and <match no="1" postag="SUB.*" postag_regexp="yes"> to transfer the POS from Putzfrau to Reinigungskraft and it does work quite good. The example mentioned above is the only one where it doesn’t work because the dative isn’t obvious enough (“den Putzfrauen” → “den Reinigungskräften” does work for example).

tiagosantos · August 25, 2017, 10:48am

This seems to be another issue, them. For the situations that the suggestion has a wrong agreement in the sentence after being ‘corrected’, you can add a rule that verifies that situation.
I imagine, it will be very tricky to make a rule that identifies the word cases, but that would be one way to go.

There is a workaround I use. Sometimes a rule provides good advice (right direction), but it isn’t perfect (as in your case, missing word agreement). After the first correction, another rule picks up what is missing and fix it as well.

Works better on LO and the standalone tool, since, in the online tool, people will have to check the document, at least twice.

Jan_Schreiber · August 25, 2017, 3:09pm

I think this could be done by filtering tags based on the context in disambiguation.xml.

tiagosantos · August 25, 2017, 6:32pm

If it works, it is indeed the easiest solution. I have found some odd results on the disambiguator-synthesizer interaction, but many other factors might have played a role. I look forward to seeing it.