Back to LanguageTool Homepage - Privacy - Imprint

Not identified words


(Ruud Baars) #12

By the way; this is about pt_PT, right? Doing it for all those variants is quite a bit of work…


(Marco A.G.Pinto) #13

Yes, pt_PT :slight_smile:


(Ruud Baars) #14

I am almost done. To my surprise, the pt Hunspell is very tolerant. At first, it allows for - to break the word. So every word consisting of valid parts separated by - are accepted. That is rather tolerant.
So you will find words like --a and a-- as valid.
I used words found in my collection of PT texts for the frequency; I unmunched Hunspell for the max amount of tolerated words.
I dumped the postag dictionary and used that to check if the words had a postag.

If you send me an email at info at taaltik.nl I will return the results zipped; it is about 200 MB of words and frequency numbers.


(Ruud Baars) #15

You can download it here: taaltik.xs4all.nl/POR/pt_accepted_no_tag.zip
Please inform me when you did; I will remove it then.


(Marco A.G.Pinto) #16

Hello!

I have just downloaded it.

Thanks!


(Ruud Baars) #17

By the way… don’t be surprised when there are some words in it that are incorrect. I used Hunspell -G to list the correct words; that has a bug that also lists the parts of corrects words having a - and are correct as a whole, but the part is not.
If you want those to be gone, you can perform a Hunspell -L -d pt_PT on the list to remove those.


(Marco A.G.Pinto) #18

@tiagosantos

Hello!

The following symbols give not identified words in Portuguese, at least in LibreOffice and I noticed the last in MS Word:
∑ ≠ →

How do I add them?

Thanks!


(Tiago F. Santos) #19

@marcoagpinto

Have you already finished the postag task? I have only seen a dozen of new postags added in a handful of commits.


(Marco A.G.Pinto) #20

I have been doing it very slowly, I know :frowning:

So many things going on at the same time.


(Tiago F. Santos) #21

No problem at all. Just better to focus on a task at the time, for efficiency sake.
Regarding the symbols (e.g. ∑ ≠ →), I can see the issue now. You can add those as ?PUNCT, but it may be a disheartening task if you do it in case-by-case basis. Better to dump a UNICODE math symbol table and “replace all” regexp (.) by \1\t\1\t_PUNCT\n.


(Marco A.G.Pinto) #22

It is too hard/complex for me.

Tiago,

There are some unidentified words with verbs forms:
abarcá-lo-á
abarcá-lo-emos
abarcá-lo-ão
abarcar-lhos

How should I add them to added.txt?

Thanks!


(Tiago F. Santos) #23

Hi Marco,

Regarding POSs, I think they are considered diferent particles nao, since I introduced some tokenization changes. You should do it via disambiguation, if you deem it fit, but the main issue with this word forms is actually their spelling recognition. There are many verb form that are still not recognized by hunspell if 'mesoclises; is uses.
This ‘mesoclises’ forms are an issue I haven’t yet found a good solution to it. Even our Hunspell dictionary uses a form of uncompressed inflected verb forms with prefixes (uppercase L and P) to recognize all ‘mesoclises’ forms of some verbs, which is a computationally very expensive way to do it, as well as it require a great deal of manual input. I am still thinking of a solution that does not involve adding all base forms of a verb, as it is done at the moment, or if done, done with an automated script for all relevant verbs. If you have feasible ideas, I am very happy to hear them.

For examples see:
https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/hunspell/pt_PT.dic

assentes/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
assente/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
assentimos/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
assentis/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
assentem/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
assinto/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=s,T=p]
assinta/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=s,T=pc]
assintas/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=pc]
assinta/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=pc]
assintamos/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=pc]
assintais/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=pc]
assintam/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=pc]
assente/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=i]
assinta/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=i]
assintamos/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=i]
assenti/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=i]
assintam/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=i]
consentes/LS [$consentir$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
consente/LS [$consentir$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
consentimos/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
consentis/LS [$consentir$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
consentem/LS [$consentir$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
consinto/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=s,T=p]
consinta/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=s,T=pc]
consintas/LS [$consentir$CAT=v,T=inf,TR=_$P=2,N=s,T=pc]
consinta/LS [$consentir$CAT=v,T=inf,TR=_$P=3,N=s,T=pc]
consintamos/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=p,T=pc]

(Marco A.G.Pinto) #24

@tiagosantos

Hello!

Tonight’s diff gives a hit in “os estudantes que possuem diploma de uma escola profissionalizante também podem entrar.”
https://languagetool.org/regression-tests/20190127/result_pt-PT_20190127.html

I added the POS entries as:

|profissionalizante|profissionalizante|AQ0MS0|
|profissionalizantes|profissionalizante|AQ0MP0|
|profissionalizantes|profissionalizante|AQ0FP0|


To get a valid POS I try to find other words whose Priberam dictionary says it is of the same kind and found on the morphological database of LanguageTool.

Could you confirm if the three entries I added are the most correct ones?

Notice that for the plural above, Priberam says “masculine and feminine” so I added two entries, one masculine and other feminine as I was not sure how to do it in one POS.

Thanks!


(Tiago F. Santos) #25

Hi Marco,

Given that profissionalizante is an ungendered adjective you can either add more POS with the feminine form or change M to C. Notice that in your list you forgot to add the feminine form for the singular form of profissionalizante, as you did with the plural.


(Marco A.G.Pinto) #26

Hello Tiago,

I have just fixed it:

Thank you!


(Marco A.G.Pinto) #27

@tiagosantos

Hello!

A few days ago I added the POS for “driver” and “drivers”.

Could you suggest that it is a foreign word and to replace with “controlador” or “controladores”?

Thanks!


(Marco A.G.Pinto) #28

Hello @tiagosantos

I am adding POS to words.

The word “t-shirts” triggers a false positive in LibreOffice.

Could you check?

Thanks!


(Tiago F. Santos) #29

Hi @marcoagpinto,

This needs the dictionary to be changes. Have you tried replacing the standard hunspells libreoffice dictionaries with the ones I am maintaining (https://github.com/TiagoSantos81/PortugueseLibreOfficeExtension)?
They are a bit outdated now, and I will push a new version one of these days, but they shoud work.


(Marco A.G.Pinto) #30

@tiagosantos
I am using the Minho university speller.

I am about to download and install your version.

The bad thing is that while adding POS to words, several words (from the list generated by the other LT member) appear as typos, and I have only been adding POS to words that appear as not identified and not to the ones that appear as typos :frowning:

My silly idea was to first process all based on the Minho speller and then do a second check with your speller.

This was a silly idea since I should have done it from the beginning with yours.

Now I will have twice the work.

:slight_smile:


(Tiago F. Santos) #31

Marco, both ideas are good. It is a daunting task. If you are already using those dictionaries, I may suggest one way to accelerate the task.
You can replace the U.Minho tags by POS if you decode them. For example:

|...|[CAT=punctj]|
|---|---|
|à|[$ao$CAT=cp,Prep=a,Art=o$G=f,N=s]|
|abacateiro/p|[CAT=nc,G=m,N=s]|
|abacate/p|[CAT=nc,G=m,N=s]|
|abacaxi/p|[CAT=nc,G=m,N=s]|
|ábaco/p|[CAT=nc,G=m,N=s]|

[CAT=punct*] is equivelent to POS _PUNCT
[CAT=nc,G=m,N=s]| is equivelent to POS NCMS000

if you replace all those by their POS equivelent, remove the affixes and open in a Calc (for example) you can create a simple “POS dictionary”. Then you have to just run LT on it to triage the words that don’t have POS. It still takes a lot of time, but it is faster because you can just delete large chunks of the table.
If you need help I can provide a baseline with part of the dictionary with this conversion.