Back to LanguageTool Homepage - Privacy - Imprint

Word POS tagged, shows as an error.

I’m finding some of these, can’t figure out why:

Example:

“Quando, numa manhã de início de novembro de 1961, Heloísa Silva acordou,”

The word (name) “Heloísa” gets correctly POS tagged (NPFS000), but still shows as spelling error.

Ahhhhh… it is missing in spelling.txt

I am about to add it.

Fixed:

Ok, but I was really trying to understand how it got tagged in the first place. If you currently change language to pt_BR the spelling error doesn’t show, even without adding it to spelling.txt. What am I missing?

The Portuguese (Portugal) and Portuguese (Brazilian) are two different dictionaries.

This means that they don’t have the same wordlist.

When a word is missing, it needs to be added to the file: spelling.txt

It is me who is creating the morphologic dictionary as well, but it is only for missing morphologic entries, since we already have a morphologic dictionary made by someone else.

Well, nice, then maybe you can help me understand?

Where did the word “Heloísa” got POS tagged?

Like I said before, the tag dictionary was created by a third party, a guy who creates open-source tag dictionaries whose name I can’t remember.

Ok, can you point me to where that tag dictionary is? Is it hunspell? Is it a regular file? Is it a java file? Combinations of the previous? I want to take a look at how it’s done, but it would help me start if I knew where to look, can you help?

The tagger dictionaries are here in binary format (.dict and .info extensions):

Instructions for exporting and building these dictionaries here:

Many thanks. I’ve already found it, it’s based on a somewhat old work (last file version from 2009) and without the new agreement AO90. I’ve found a few errors I would like to correct, the second link seams to be what I need.

Some more information that can be helpful.

The files added.txt and removed.txt have changes to the tagger dictionary. Eventually, these changes can be merged in the binary dictionary.

There is a setting that could be enabled for Portuguese: ignore any tagged word in the speller, so you don’t need to add new words to spelling.txt. This is enabled for some languages, but not for all. The reason for not enabling it is that the speller dictionaries take care of regional spelling variants. I don’t know if that is the case for Portuguese.

Hi,
It seems interesting, but it PT case that would probably mean that pre-AO90 words would all be validated if there was no rule to match them. Do you know if there’s a way to have a different POS dictionary to language variants?
Thanks

The current procedure is the correct one.

No need to change it.

How can I enable that setting to test it?

Thanks

I see now that it is not immediately available. It has to be implemented. We would need to do in HunspellRule.java something similar to setIgnoreTaggedWords() in MorfologikSpellerRule.java.

The current state of the Portuguese dictionaries is far from perfect.

We have different Hunspell dictionaries for spelling (with diverging formats for every variant). Hunspell dictionaries have downsides, and it is not our preferred format.

If someone can take over the job of comparing, fixing and completing the dictionaries, we could do either 1 or 2:

  1. Create a unique tagger dictionary (for tagging and spelling), and some rules for the differences among language varieties.
  2. Create a tagger dictionary (for tagging and spelling) for every language variety.

The Hunspell dictionaries should be expanded, and we should make sure that every word is tagged properly in the new dictionaries.

I took a look to HunspellRule.java and I couldn’t find a way to easily reproduce what’s done in MorfologikSpellerRule.java because of the way compositon words are checked (using “-”).

The fastest way I see is to remove all pos-tagged words form the sentence first, and then feed it to hunspell rule. That’s kind of an ugly hack. Also hunspellrule has something to manage english words, but I didn’t got to analise that, so removing pos-taggged words may break that part.

Any thoughts?

Thanks