Postag dictionary single quotes

Ruud_Baars · March 20, 2018, 10:30am

How could I get the postag dictionary to treat the several single apostrophs equally? Now, foto’s is correctly tagged, but foto´s and foto’s are not.

dnaber · March 20, 2018, 10:52am

For German, we have rules that complain about the wrong kinds of apostrophes. (I know it doesn’t directly answer your question, but maybe it’s a pragmatic approach anyway.)

Ruud_Baars · March 20, 2018, 10:57am

Hm. Depends on what is considered correct. In plain test, the simple ’ is commonly used. In the word processor, the character is often automagically altered into something else (‘more beautiful’). And some newspapers have their own style of single quoting.
I do warn about this. But since the rule can be switched off, the characters remain. And I still want postagging to work correctly then, so grammar checking can still work.

I could work around it by generating all common forms from the existing words and tags, and add those to the postagger. But that is a bit of a trick. In hunspell, characters can be changed when coming in, and back when going out.

For now, I will just add the entries.

jaumeortola · March 20, 2018, 10:18pm

This is what I do in Catalan.

Everything (grammar rules, dictionary, etc.) uses typewriter apostrophes (').

If the input text has a typographical apostrophe (’), it is converted to a typewriter apostrophe ('). This is for words containing an apostrophe, not for quotation marks. See languagetool/CatalanTagger.java at master · languagetool-org/languagetool · GitHub

A ChunkTag is added to a token if it contains a typewriter apostrophe (see languagetool/CatalanTagger.java at master · languagetool-org/languagetool · GitHub). This tag can be used later in a typographical rule (see: Regla "Exigeix l'apòstrof tipogràfic (’)")

Ruud_Baars · March 21, 2018, 5:59am

That is an option; unfortunately it is Java. And the same replace would be needed in the spellchecker as well.
I had a look in the code. Though the replace is quite simple, the rest of it is much too complicated for me. Maybe someone would like to do this (and commit it after the upcoming release?)