Word POS tagged, shows as an error.

Miguel.Andrade · March 24, 2021, 11:22am

I’m finding some of these, can’t figure out why:

Example:

“Quando, numa manhã de início de novembro de 1961, Heloísa Silva acordou,”

The word (name) “Heloísa” gets correctly POS tagged (NPFS000), but still shows as spelling error.

marcoagpinto · March 24, 2021, 11:32am

Ahhhhh… it is missing in spelling.txt

I am about to add it.

marcoagpinto · March 24, 2021, 11:36am

Fixed:

Miguel.Andrade · March 24, 2021, 1:24pm

Ok, but I was really trying to understand how it got tagged in the first place. If you currently change language to pt_BR the spelling error doesn’t show, even without adding it to spelling.txt. What am I missing?

marcoagpinto · March 24, 2021, 1:43pm

The Portuguese (Portugal) and Portuguese (Brazilian) are two different dictionaries.

This means that they don’t have the same wordlist.

When a word is missing, it needs to be added to the file: spelling.txt

It is me who is creating the morphologic dictionary as well, but it is only for missing morphologic entries, since we already have a morphologic dictionary made by someone else.

Miguel.Andrade · March 24, 2021, 5:17pm

Well, nice, then maybe you can help me understand?

Where did the word “Heloísa” got POS tagged?

marcoagpinto · March 24, 2021, 9:03pm

Like I said before, the tag dictionary was created by a third party, a guy who creates open-source tag dictionaries whose name I can’t remember.

Miguel.Andrade · March 24, 2021, 9:33pm

Ok, can you point me to where that tag dictionary is? Is it hunspell? Is it a regular file? Is it a java file? Combinations of the previous? I want to take a look at how it’s done, but it would help me start if I knew where to look, can you help?

jaumeortola · March 24, 2021, 10:37pm

The tagger dictionaries are here in binary format (.dict and .info extensions):

Instructions for exporting and building these dictionaries here:

Miguel.Andrade · March 24, 2021, 10:51pm

Many thanks. I’ve already found it, it’s based on a somewhat old work (last file version from 2009) and without the new agreement AO90. I’ve found a few errors I would like to correct, the second link seams to be what I need.

jaumeortola · March 25, 2021, 8:33am

Some more information that can be helpful.

The files added.txt and removed.txt have changes to the tagger dictionary. Eventually, these changes can be merged in the binary dictionary.

There is a setting that could be enabled for Portuguese: ignore any tagged word in the speller, so you don’t need to add new words to spelling.txt. This is enabled for some languages, but not for all. The reason for not enabling it is that the speller dictionaries take care of regional spelling variants. I don’t know if that is the case for Portuguese.

Miguel.Andrade · March 25, 2021, 8:47am

Hi,
It seems interesting, but it PT case that would probably mean that pre-AO90 words would all be validated if there was no rule to match them. Do you know if there’s a way to have a different POS dictionary to language variants?
Thanks

marcoagpinto · March 25, 2021, 11:15am

The current procedure is the correct one.

No need to change it.

Miguel.Andrade · March 26, 2021, 5:25pm

How can I enable that setting to test it?

Thanks

jaumeortola · March 26, 2021, 7:03pm

I see now that it is not immediately available. It has to be implemented. We would need to do in HunspellRule.java something similar to setIgnoreTaggedWords() in MorfologikSpellerRule.java.

jaumeortola · March 26, 2021, 7:16pm

The current state of the Portuguese dictionaries is far from perfect.

We have different Hunspell dictionaries for spelling (with diverging formats for every variant). Hunspell dictionaries have downsides, and it is not our preferred format.

If someone can take over the job of comparing, fixing and completing the dictionaries, we could do either 1 or 2:

Create a unique tagger dictionary (for tagging and spelling), and some rules for the differences among language varieties.
Create a tagger dictionary (for tagging and spelling) for every language variety.

The Hunspell dictionaries should be expanded, and we should make sure that every word is tagged properly in the new dictionaries.

Miguel.Andrade · March 27, 2021, 12:38pm

I took a look to HunspellRule.java and I couldn’t find a way to easily reproduce what’s done in MorfologikSpellerRule.java because of the way compositon words are checked (using “-”).

The fastest way I see is to remove all pos-tagged words form the sentence first, and then feed it to hunspell rule. That’s kind of an ugly hack. Also hunspellrule has something to manage english words, but I didn’t got to analise that, so removing pos-taggged words may break that part.

Any thoughts?

Thanks