[pt] à-há rule issues

marcoagpinto · October 21, 2016, 7:11am

Hello!

The rule I created yesterday caused a lot of false positives.

Tiago Santos sent me an e-mail explaining how to fix it, but it still doesn't work.

How should I code it?:

    <!-- À n SEGUNDOS/MINUTOS/HORAS/DIAS/SEMANAS/MESES/ANOS há n tempo-->
    <rule id="À-HÁ_N_TEMPO" name="há n tempo">
      <pattern>
        <marker>
            <token skip="-1">à</token>
        </marker>    
        <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token>
      </pattern>
      <message>Substituir «à» por <suggestion><match no="1" include_skipped="all"/> <match no="2"/></suggestion>.</message>
      <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example>
    </rule>

For example: Conheço a Rita à muito perto de 30 anos
it should match:

à
one or more words
segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?

and suggest to replace “à” with “há”.

Thanks!

marcoagpinto · October 21, 2016, 7:24am

The old rule was:
 <rule id="À_QUASE" name="há quase"> <pattern> <marker> <token>à</token> </marker> <token skip="-1">quase</token> <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token> </pattern> <message>Substituir «à» por <suggestion>há</suggestion>.</message> <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example> </rule>

But I wanted it to accept more words than just “quase”

marcoagpinto · October 21, 2016, 8:14am

Anyway, it is now fixed:

    <!-- À n SEGUNDOS/MINUTOS/HORAS/DIAS/SEMANAS/MESES/ANOS há n tempo-->
    <rule id="À-HÁ_N_TEMPO" name="há n tempo">
      <pattern>
        <marker>
            <token>à</token>        
        </marker>            
        <or>
            <token skip="-1" regexp="yes">quase|poucos?|alguns</token>        
            <token min="0"></token>
        </or>    
        <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token>
      </pattern>
      <message>Substituir «à» por <suggestion>há</suggestion>.</message>
      <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example>
    </rule>

tiagosantos · October 21, 2016, 5:29pm

@marcoagpinto

The rule makes sense but it covers to many situations due to:

<token skip="-1"> this will make is fit a big sentence that has ‘segundos’ on other clauses.

I suggested you the <token min='0'></token> but that can make the rule very strict.
From my understanding, another way that you can try to avoid this type of over generalization is with the exception:

<exception scope="next">and</exception></token> (changing ‘and’ for a very broad verbal postag like VM[CIS][CFIMPS][123][SP]0)

or more easily:

by replacing ‘skip="-1"’ (meaning all tokens until the time expression is found)
by
‘skip=“X”’ or any other number you find reasonable (meaning, skipping 0 to X next words).

I would test the second possiblity first and see how it goes.

tiagosantos · October 21, 2016, 5:34pm

This applies to the first rule (the one that I had seen):

 <rule id="À-HÁ_N_TEMPO" name="há n tempo"> <pattern> <marker> <token skip="-1">à</token> </marker> <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token> </pattern> <message>Substituir «à» por <suggestion>há</suggestion>.</message> <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example> </rule>
and all related rules.
This will not require you to specify quase|poucos?|alguns

marcoagpinto · October 21, 2016, 7:41pm

@tiagosantos
It is not that simple because I was looking at the nightly diff.

One of the lines there was:
“A contagem dos anos assemelha-se à ordem dos números inteiros”

Imagine that they added to the sentence “e das horas.”
“A contagem dos anos assemelha-se à ordem dos números inteiros e das horas.”

This would generate tons of false positives all over.

tiagosantos · October 21, 2016, 7:48pm

3 seams a good number. Or this:

<token skip="-1">à<exception scope="next">VM[CIS][CFIMPS][123][SP]0</exception></token>

Or both. Like this:

<rule id="À-HÁ_N_TEMPO" name="há n tempo"> <pattern> <marker> <token skip="4">à<exception scope="next" postag='VM[CIS][CFIMPS][123][SP]0' postag_regexp='yes'></exception></token> </marker> <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token> </pattern> <message>Substituir «à» por <suggestion>há</suggestion>.</message> <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example> </rule>

marcoagpinto · October 21, 2016, 7:59pm

@tiagosantos

Could you commit the improvements?

I work on the weekend and now only next week I will have time to work on my projects.

Thanks!

Kind regards,

tiagosantos · October 21, 2016, 8:06pm

I was coming here to actually review the suggestion while I was thinking of it.

I missed the postag when advising. It is a exception scope="next" postag='VM[CIS][CFIMPS][123][SP]0' postag_regexp='yes'></exception>.

If you have to tested it I do not mind pushing. That if the DNS recognizes github again, because after a couple of commits it has stopped finding the servers. Can you check if this is also happening on your side?

tiagosantos · October 21, 2016, 8:12pm

Edited the post so there is no mistakes. Notice I have not (nor will) tested the rule.

tiagosantos · October 21, 2016, 8:34pm

Could not keep my hand on the pockets and tested it… not working… at all.

It would make sense…

My apologies. I am putting this on old for now, but I will get back to this rule once I manage to push and test the other verb rules.

tiagosantos · October 21, 2016, 9:18pm

Ok. Finally I have pushed my rules and found the issue that made testrules.sh complain.

The rule is fine but ‘rule id’ is duplicated with other rule.
But this rule is unable to find the issue in any of your examples. 4 words limit. You have to decide how many false positives you find acceptable.
Still, after testing the exception rule for all intermediate verbs restricts the rule to its own clause.

Leaving now.

marcoagpinto · October 21, 2016, 10:21pm

I believe that a rule is better to be detect only a few of the mistakes but well (accurate) than to detect a lot but have a lot of inaccuracy.

marcoagpinto · October 21, 2016, 11:03pm

“Poucos mas bons é melhor do que muitos, mas maus!”

tiagosantos · October 22, 2016, 2:17pm

Then, I believe this suggestions are perfectly suitable.
You may want to consider review your remaining rules (on a case per case basis) with the same guideline:

Search for skip=‘-1’
See it it makes sense to reduce the scope to skip=‘3’ or skip=‘4’
See if there are exceptions that cancel the rule rule and add them with:
<exception scope="next" regexp='yes'>something|other</exception>

Later on, after you see the regression test results and if you are satisfied with the drop in false positives, you may want to review the scope of each rule and see if it can be further extended with postag_regexp.

But first things first.

tiagosantos · October 24, 2016, 2:32pm

I do agree entirely with this proposition, but I believe it is important to clarify, that despite the large number of detections on the first iteration of my rules, they are still accurate above the average. Or at least why I think they are.

The set of rules that I have committed so far tests: verbal congruency with the subject, number and gender congruency between all articles, nouns and adjectives. In all romance languages (and many other language groups too), these relations occur at least three times per sentence, and, sometimes, many times more. The set of rules I have committed so far, test all these relations.

So, let us crunch numbers for better understanding.

After the first set of rules was committed, the WikiChecks results were:
https://languagetool.org/regression-tests/20161018/result_pt_20161018.html

-Portuguese: 2899 total matches
-Portuguese: ø0,07 rule matches per sentence
+Portuguese: 4468 total matches
+Portuguese: ø0,11 rule matches per sentence

If I understand the number correctly, the set of 4 rules increased false detection by 0,04 matches per sentence. We can estimate false positive rate by:
rule accuracy = rule matches per sentence / average number of tests
(if there is a technical formula more commonly used in this field, please advise)

So, this means that this first set of rules, (that did not account for words that had dictionary issues or have noun and verbal meaning was actually) caused 1 false positive each 50 rule tests (1 / (0,04 / 2)). I am not an expert, but it looks quite good to me. Still, with the first interaction I lowered this value to at least 0,02 matches per sentence (1 in 100 false positives per rule test (1 / ((0,09 - 0,07) / 2))) with just a few tweaks to the rules, as seen below.

The full set of rules I described before, present in yesterday’s results, shows:
https://languagetool.org/regression-tests/20161023/result_pt_20161023.html

-Portuguese: 3760 total matches
-Portuguese: ø0,09 rule matches per sentence
+Portuguese: 4962 total matches
+Portuguese: ø0,12 rule matches per sentence

Even without touching anything else, my overhaul ‘mismatch’ contribution is 0,05 matches per sentence (0,12-0,07) or rule accuracy = 0,05 / 3 ≈ 0,02 (1 in 50 mismatch).

All this assuming that 100% of the detections are false positives for easier calculation reasons. Notice that they are not all false positives. Sometimes they detect real grammar mistakes, and, more often, they detect an incongruency where the orations should have been separated by a comma.

Also, notice that the yesterday’s increase is mostly due to ‘ser’ noun verb property. This, among other things, was fixed immediately after the results. See:

‘Ser’ is a singular masculine noun. In order to understand the importance of this single word, a table with the search results rule_ids in the last regression tests follows:

ERRO_DE_CONCORDNCIA_DO_GÉNERO_MASCULINO[1]		855 
ERRO_DE_CONCORDNCIA_DO_GÉNERO_MASCULINO_2[1]	        115
ERRO_DE_CONCORDNCIA_DO_GÉNERO_FEMININO[1]		253
ERRO_DE_CONCORDNCIA_DO_GÉNERO_FEMININO_2[1]		77

Last and not the least, I have been very conservative on the estimate of rule checks per sentence, but until I find a tool that automates this counting, these values seam appropriate.

24-10 Regression tests

-Portuguese: 4962 total matches -Portuguese: ø0,12 rule matches per sentence +Portuguese: 4323 total matches +Portuguese: ø0,11 rule matches per sentence

ERRO_DE_CONCORDNCIA_DO_GÉNERO_MASCULINO[1] -563 ERRO_DE_CONCORDNCIA_DO_GÉNERO_MASCULINO_2[1] -107

marcoagpinto · October 25, 2016, 7:54pm

@tiagosantos

Tiago, the rules “há atrás” and “à-há” aren’t working.

I tested them with last night’s build:

Isto aconteceu há cerca de dez horas atrás!
Isto aconteceu à dez horas!

Both don’t appear as grammar errors.

Could you see what is wrong?

Thanks!

tiagosantos · October 25, 2016, 8:18pm

Please, see today’s regressions tests to confirm if it is fixed.

marcoagpinto · October 28, 2016, 9:21am

@tiagosantos

Sorry for taking long to reply.

I have just tested with LanguageTool-20161027-snapshot.oxt and it is working.

Thank you!