[pt] Postreform compounds

marcoagpinto · November 4, 2016, 1:10pm

Can I add 600 000 compounds to pt_PT postreform compounds?
For example, words such as:
“dar-nos-ás”

Or is there a special procedure to do it?

Thanks!

dnaber · November 4, 2016, 2:04pm

As long as the license allows the words to be added and we properly document the license, it should be okay. It might add to the start-up time of Portuguese, as the file will need to be loaded first. How large would the file be?

marcoagpinto · November 4, 2016, 3:31pm

Daniel, I have just done it.

It is a 10 MB file.

I extracted all compound words from the Minho University speller, deleted a few ones that could create false positives and merged the 600 000 words with the ones in Tiago Santos’ list.

tiagosantos · November 4, 2016, 4:04pm

Impressive.
But why not be bandwidth concious and try a slimmer solution.
For example:

<rulegroup id='HIFENIZADOR_VERBOS' name='Colocações pronominais'> <rule> <pattern> <token postag='V.*' postag_regexp='yes'></token> <token regexp='yes'>m[aeo]|t[aeo]|s[aeo]|lh[aeo]s?|n[ao]s|vos</token> <token regexp='yes'>á|ão|ás|ei|eis|emos|ia|iam|íamos|ias|íeis</token> </pattern> <message>Palavra composta. Pretende dizer</message><suggestion>\1‑\2‑\3</suggestion> <url>https://pt.wikipedia.org/wiki/Coloca%C3%A7%C3%A3o_pronominal</url>  <example correction='dar‑nos‑á'><marker>dar nos á</marker>.</example> </rule> <rule> <pattern> <token postag='V.*' postag_regexp='yes'></token> <token regexp='yes'>m[aeo]|t[aeo]|s[aeo]|lh[aeo]s?|n[ao]s|vos</token> </pattern> <message>Palavra composta. Pretende dizer</message><suggestion>\1‑\2</suggestion> <url>https://pt.wikipedia.org/wiki/Coloca%C3%A7%C3%A3o_pronominal</url>  <example correction='dar‑nos'><marker>dar nos</marker>.</example> </rule> </rulegroup>

Think about the trees… and the polar bears.

marcoagpinto · November 4, 2016, 4:29pm

@tiagosantos

Well, text files have a very high compression ratio

I have just compressed it into .zip and it became ~ 1 MB

So, 1 MB isn’t much

tiagosantos · November 4, 2016, 4:43pm

No point appealing to reason, right?

marcoagpinto · November 4, 2016, 5:11pm

@tiagosantos

Don’t worry… I have just fixed it by only keeping words with one hyphen.

Now it is only 2 MB big.
And I kept yours:
á-bê-cê*
á-bê-cês*

marcoagpinto · November 4, 2016, 10:27pm

@tiagosantos

I have seen the nightly diff results regarding the postreform compounds.

On Monday I will remove the false positives as I have the weekend job and won’t have the chance to do much.

tiagosantos · November 4, 2016, 11:48pm

@marcoagpinto @matheuspoletto @dnaber @macios

github.com/languagetool-org/languagetool

[pt] HIFENIZADOR_VERBOS false positive fix

committed 10:36PM - 04 Nov 16 UTC

TiagoSantos81

+16 -9

* rulegroup split in two due to pt-PT and pt-BR differences * two word verbal co…locations set as default='off' due to language variant differences * exclusion for se, nos, nas due to excessive false positives @Marco PT_COMPOUNDS_POST_REFORM (post-reform-compounds.txt) * purge all a,o,as,os,se,nos,nas terminations * all remaining two term verbal form compound words MUST be removed due to conflict with inverted Brazilian Portuguese colocations

False positives in hifenation is a complaint since:

Unless there is a way to selectively disabling a list of terms in the post-reform-compounds.txt (for pt-BR) you need to revert all changes to it since 82083f9

marcoagpinto · November 5, 2016, 6:38am

@tiagosantos

I have been up since 6am as I can’t sleep.

So, I have dedicated some time to fix the compounds.

[pt] PT_COMPOUNDS_POST_REFORM (post-reform-compounds.txt)

purge all a,o,as,os,se,nos,nas terminations
Now it only has 120K compound words.

I will look at tonight’s Nightly Diff to see the new results.

marcoagpinto · November 5, 2016, 6:56am

@tiagosantos

While checking the new code in grammar.xml, I get an exception in one of your rules:

Running pattern rule tests for Portuguese… The Portuguese rule: HIFENIZADOR_VERBOS_1[1] (exception in token [1]), token [1], contains “como|para|casa” that is not marked as regular expression but probably is one.

I did the test after implementing Yakov’s “há n tempo atrás” rule improvement.

tiagosantos · November 5, 2016, 8:20am

Fixed.

Not fixed AND you reverted my former changes.

@Daniel
If in the past, simple unfitteness could be an acceptable excuse, now it is quite obvious to anyone following this that Marco is intentionally hindering the project.
Sorry to keep pulling you into this, but, I believe there is a need for arbitrage here.

marcoagpinto · November 5, 2016, 8:45am

@tiagosantos
I am terribly sorry… I have just readded all your words.

What happened is that I used my code to remove the problematic compounds and forgot to remove the only one hyphen condition.

Is it now the way you want?

marcoagpinto · November 5, 2016, 8:56am

@tiagosantos

So, I have removed:

purge all a,o,as,os,se,nos,nas terminations

Removing all the two term verbal form compounds isn’t the answer.

We shall see with today’s diff if the purge I did solved the problem.

And the Brazilian guys usually have the second part of the verb before the verb, so I believe my fix will work.

@dnaber @tiagosantos
I am not hindering the project, I just want it to work as good and accurate as possible. There is no point in having tons of rules if they produce tons of false positives.

tiagosantos · November 5, 2016, 8:57am

This is not about what I want.
I have nothing against mass additions to the files. Had you done such work in the past and I might not be here in the first place.
But, your solution is not “fixable” because it excludes users of Brazilian Portuguese.

I am from Portugal. Speak portuguese. Brazilian users also have CGooGR and Lightproof, both solutions that I had considered porting from scratch to European Portuguese.
Despite all that, common language assets must be developed accomodating both language variants and Brazilian Portuguese user base is 20 times larger than the one from Portugal.

You might have noticed that I created a directory for pt-PT. It is not active yet, but once I figure out how it will be. After that, solutions like yours can become more palatable, if done right.

Please, let me avoid these type of Monty Pythonesque situations again.

marcoagpinto · November 6, 2016, 11:23am

@tiagosantos

Tiago, I saw the nightly diff and all compounds seem to be working okay.
https://languagetool.org/regression-tests/20161105/result_pt_20161105.html

Could you confirm?

Thanks!

tiagosantos · November 6, 2016, 3:00pm

Compare with this:
https://languagetool.org/regression-tests/20161104/result_pt_20161104.html

Search HIFENIZADOR_VERBOS[2] (you will notice that all of them were reversed)
Search PT_COMPOUNDS_POST_REFORM

Find the 113 matches that produce a increase in total matches.

Can you tell me what are the remaining matches?

marcoagpinto · November 6, 2016, 11:15pm

@tiagosantos

I was looking at it but the only added hits I saw was the ones from the “há n tempo atrás” rule which Yakov improved.

In simple words, I simply removed all the compounds with the endings you told me to, and replaced the rule Yakov improved.

So, with only those two changes done, it can only remove false positives and increase the Yakov.

tiagosantos · November 7, 2016, 10:02am

Yakov help you with a tip for suggestions, not patterns. The tip as great and I even used it yesterday on the paronyms group.

That rule was not edited by anyone that day, nor the day before as anyone can see on:

Even if it did, HÁ_N_TEMPO returns exactly search 6 matches, all of them in hidden text, on:
https://languagetool.org/regression-tests/20161104/result_pt_20161104.html

I will give you those 6 matches. There are still 107 others to figure out.

marcoagpinto · November 7, 2016, 12:15pm

@tiagosantos @dnaber @Yakov

Tiago, sorry to contradict you, but you probably didn’t notice this at 6am or so:

Branch: refs/heads/master
Home: GitHub - languagetool-org/languagetool: Style and Grammar Checker for 25+ Languages
Commit: 928a952c89bcd94f5d4f457a7ab153e5f80941d4
[pt] Fixed rule "há n tempo atrás" thanks to Yakov. · languagetool-org/languagetool@928a952 · GitHub
Author: Marco A.G.Pinto marcoagpinto@mail.telepac.pt
Date: 2016-11-05 (Sat, 05 Nov 2016)

Changed paths:
M languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/grammar.xml

Log Message:

[pt] Fixed rule “há n tempo atrás” thanks to Yakov.

I really don’t know what besides this could cause extra hits, because I only added Yakov and removed the compounds endings the way you told me to.

Tiago, if you find out, please tell me as I would like to know it too