Back to LanguageTool Homepage - Privacy - Imprint

Postag dictionary single quotes


(Ruud Baars) #1

How could I get the postag dictionary to treat the several single apostrophs equally? Now, foto’s is correctly tagged, but foto´s and foto’s are not.


(Daniel Naber) #2

For German, we have rules that complain about the wrong kinds of apostrophes. (I know it doesn’t directly answer your question, but maybe it’s a pragmatic approach anyway.)


(Ruud Baars) #3

Hm. Depends on what is considered correct. In plain test, the simple ’ is commonly used. In the word processor, the character is often automagically altered into something else (‘more beautiful’). And some newspapers have their own style of single quoting.
I do warn about this. But since the rule can be switched off, the characters remain. And I still want postagging to work correctly then, so grammar checking can still work.

I could work around it by generating all common forms from the existing words and tags, and add those to the postagger. But that is a bit of a trick. In hunspell, characters can be changed when coming in, and back when going out.

For now, I will just add the entries.


(jaumeortola) #4

This is what I do in Catalan.

Everything (grammar rules, dictionary, etc.) uses typewriter apostrophes (’).

If the input text has a typographical apostrophe (’), it is converted to a typewriter apostrophe (’). This is for words containing an apostrophe, not for quotation marks. See https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/java/org/languagetool/tagging/ca/CatalanTagger.java#L91

A ChunkTag is added to a token if it contains a typewriter apostrophe (see https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/java/org/languagetool/tagging/ca/CatalanTagger.java#L120). This tag can be used later in a typographical rule (see: https://community.languagetool.org/rule/show/APOSTROF_TIPOGRAFIC?lang=ca&subId=1)


(Ruud Baars) #5

That is an option; unfortunately it is Java. And the same replace would be needed in the spellchecker as well.
I had a look in the code. Though the replace is quite simple, the rest of it is much too complicated for me. Maybe someone would like to do this (and commit it after the upcoming release?)