[pt] Postreform compounds

tiagosantos · November 4, 2016, 4:04pm

Impressive.
But why not be bandwidth concious and try a slimmer solution.
For example:

<rulegroup id='HIFENIZADOR_VERBOS' name='Colocações pronominais'> <rule> <pattern> <token postag='V.*' postag_regexp='yes'></token> <token regexp='yes'>m[aeo]|t[aeo]|s[aeo]|lh[aeo]s?|n[ao]s|vos</token> <token regexp='yes'>á|ão|ás|ei|eis|emos|ia|iam|íamos|ias|íeis</token> </pattern> <message>Palavra composta. Pretende dizer</message><suggestion>\1‑\2‑\3</suggestion> <url>https://pt.wikipedia.org/wiki/Coloca%C3%A7%C3%A3o_pronominal</url>  <example correction='dar‑nos‑á'><marker>dar nos á</marker>.</example> </rule> <rule> <pattern> <token postag='V.*' postag_regexp='yes'></token> <token regexp='yes'>m[aeo]|t[aeo]|s[aeo]|lh[aeo]s?|n[ao]s|vos</token> </pattern> <message>Palavra composta. Pretende dizer</message><suggestion>\1‑\2</suggestion> <url>https://pt.wikipedia.org/wiki/Coloca%C3%A7%C3%A3o_pronominal</url>  <example correction='dar‑nos'><marker>dar nos</marker>.</example> </rule> </rulegroup>

Think about the trees… and the polar bears.

marcoagpinto · November 4, 2016, 4:29pm

@tiagosantos

Well, text files have a very high compression ratio

I have just compressed it into .zip and it became ~ 1 MB

So, 1 MB isn’t much

tiagosantos · November 4, 2016, 4:43pm

No point appealing to reason, right?

marcoagpinto · November 4, 2016, 5:11pm

@tiagosantos

Don’t worry… I have just fixed it by only keeping words with one hyphen.

Now it is only 2 MB big.
And I kept yours:
á-bê-cê*
á-bê-cês*

marcoagpinto · November 4, 2016, 10:27pm

@tiagosantos

I have seen the nightly diff results regarding the postreform compounds.

On Monday I will remove the false positives as I have the weekend job and won’t have the chance to do much.

tiagosantos · November 4, 2016, 11:48pm

@marcoagpinto @matheuspoletto @dnaber @macios

github.com/languagetool-org/languagetool

[pt] HIFENIZADOR_VERBOS false positive fix

committed 10:36PM - 04 Nov 16 UTC

TiagoSantos81

+16 -9

* rulegroup split in two due to pt-PT and pt-BR differences * two word verbal co…locations set as default='off' due to language variant differences * exclusion for se, nos, nas due to excessive false positives @Marco PT_COMPOUNDS_POST_REFORM (post-reform-compounds.txt) * purge all a,o,as,os,se,nos,nas terminations * all remaining two term verbal form compound words MUST be removed due to conflict with inverted Brazilian Portuguese colocations

False positives in hifenation is a complaint since:

Unless there is a way to selectively disabling a list of terms in the post-reform-compounds.txt (for pt-BR) you need to revert all changes to it since 82083f9

marcoagpinto · November 5, 2016, 6:38am

@tiagosantos

I have been up since 6am as I can’t sleep.

So, I have dedicated some time to fix the compounds.

[pt] PT_COMPOUNDS_POST_REFORM (post-reform-compounds.txt)

purge all a,o,as,os,se,nos,nas terminations
Now it only has 120K compound words.

I will look at tonight’s Nightly Diff to see the new results.

marcoagpinto · November 5, 2016, 6:56am

@tiagosantos

While checking the new code in grammar.xml, I get an exception in one of your rules:

Running pattern rule tests for Portuguese… The Portuguese rule: HIFENIZADOR_VERBOS_1[1] (exception in token [1]), token [1], contains “como|para|casa” that is not marked as regular expression but probably is one.

I did the test after implementing Yakov’s “há n tempo atrás” rule improvement.

tiagosantos · November 5, 2016, 8:20am

Fixed.

Not fixed AND you reverted my former changes.

@Daniel
If in the past, simple unfitteness could be an acceptable excuse, now it is quite obvious to anyone following this that Marco is intentionally hindering the project.
Sorry to keep pulling you into this, but, I believe there is a need for arbitrage here.

marcoagpinto · November 5, 2016, 8:45am

@tiagosantos
I am terribly sorry… I have just readded all your words.

What happened is that I used my code to remove the problematic compounds and forgot to remove the only one hyphen condition.

Is it now the way you want?

marcoagpinto · November 5, 2016, 8:56am

@tiagosantos

So, I have removed:

purge all a,o,as,os,se,nos,nas terminations

Removing all the two term verbal form compounds isn’t the answer.

We shall see with today’s diff if the purge I did solved the problem.

And the Brazilian guys usually have the second part of the verb before the verb, so I believe my fix will work.

@dnaber @tiagosantos
I am not hindering the project, I just want it to work as good and accurate as possible. There is no point in having tons of rules if they produce tons of false positives.

tiagosantos · November 5, 2016, 8:57am

This is not about what I want.
I have nothing against mass additions to the files. Had you done such work in the past and I might not be here in the first place.
But, your solution is not “fixable” because it excludes users of Brazilian Portuguese.

I am from Portugal. Speak portuguese. Brazilian users also have CGooGR and Lightproof, both solutions that I had considered porting from scratch to European Portuguese.
Despite all that, common language assets must be developed accomodating both language variants and Brazilian Portuguese user base is 20 times larger than the one from Portugal.

You might have noticed that I created a directory for pt-PT. It is not active yet, but once I figure out how it will be. After that, solutions like yours can become more palatable, if done right.

Please, let me avoid these type of Monty Pythonesque situations again.

marcoagpinto · November 6, 2016, 11:23am

@tiagosantos

Tiago, I saw the nightly diff and all compounds seem to be working okay.
https://languagetool.org/regression-tests/20161105/result_pt_20161105.html

Could you confirm?

Thanks!

tiagosantos · November 6, 2016, 3:00pm

Compare with this:
https://languagetool.org/regression-tests/20161104/result_pt_20161104.html

Search HIFENIZADOR_VERBOS[2] (you will notice that all of them were reversed)
Search PT_COMPOUNDS_POST_REFORM

Find the 113 matches that produce a increase in total matches.

Can you tell me what are the remaining matches?

marcoagpinto · November 6, 2016, 11:15pm

@tiagosantos

I was looking at it but the only added hits I saw was the ones from the “há n tempo atrás” rule which Yakov improved.

In simple words, I simply removed all the compounds with the endings you told me to, and replaced the rule Yakov improved.

So, with only those two changes done, it can only remove false positives and increase the Yakov.

tiagosantos · November 7, 2016, 10:02am

Yakov help you with a tip for suggestions, not patterns. The tip as great and I even used it yesterday on the paronyms group.

That rule was not edited by anyone that day, nor the day before as anyone can see on:

Even if it did, HÁ_N_TEMPO returns exactly search 6 matches, all of them in hidden text, on:
https://languagetool.org/regression-tests/20161104/result_pt_20161104.html

I will give you those 6 matches. There are still 107 others to figure out.

marcoagpinto · November 7, 2016, 12:15pm

@tiagosantos @dnaber @Yakov

Tiago, sorry to contradict you, but you probably didn’t notice this at 6am or so:

Branch: refs/heads/master
Home: GitHub - languagetool-org/languagetool: Style and Grammar Checker for 25+ Languages
Commit: 928a952c89bcd94f5d4f457a7ab153e5f80941d4
[pt] Fixed rule "há n tempo atrás" thanks to Yakov. · languagetool-org/languagetool@928a952 · GitHub
Author: Marco A.G.Pinto marcoagpinto@mail.telepac.pt
Date: 2016-11-05 (Sat, 05 Nov 2016)

Changed paths:
M languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/grammar.xml

Log Message:

[pt] Fixed rule “há n tempo atrás” thanks to Yakov.

I really don’t know what besides this could cause extra hits, because I only added Yakov and removed the compounds endings the way you told me to.

Tiago, if you find out, please tell me as I would like to know it too

tiagosantos · November 7, 2016, 2:06pm

That commit is from the 05th November.
If we consider the detections introduced by rule HÁ-ATRÁS, there was 43 hits on the regression test of 5th November. But they don’t change results since they change only suggestions and whitespaces in the pattern.

So… 113 matches that you do not even know from where they come from and you still think it is a good solution having 120k or random words added? It seams to me that it is not reviewable even by you since you can not find the errors it introduces and I believe it is not up to me, to find them, is it?

I told you before, and I reaffirm:
Put your name in the file, if you want to, but revert your changes.
They do not had nothing that is not done in a easier to maintain way, by a rule.

More, if you find any way to compress my rules into a lower number of total rules, please, tell me and I will also do it.

marcoagpinto · November 7, 2016, 2:33pm

@tiagosantos
Tiago, I am very stressed and can’t think properly right now.

Could you please revert my changes in the 120k compounds then?

In January I will try to add compounds to it but only words and not verbs.

I will have to analyse the words in the pt_PT speller to see which ones are not verbs and it will take a long time.

Right now I need to dedicate more time to the PhD project since next week I will be a few days in the North with the PhD coordinator and won’t have the chance to do much.

Tiago, I noticed the other day that one of the redundancy rules you added already existed added by me months ago, the “hemorragia de sangue” which means there are two rules for the same.

Thanks!

Kind regards,

tiagosantos · November 7, 2016, 3:04pm

Sure Marco. I will just replace the word list and leave the header as you saw fit.
I am certain you will add many more words in due time. The verbs are already covered.

This is an extra, so focus on the PhD. No need to do much. Just consistent improvement over time. We will get there.

No worries. I will review coverage and comment out the redundant one.

Cheers.