Hello!
The rule I created yesterday caused a lot of false positives.
Tiago Santos sent me an e-mail explaining how to fix it, but it still doesn't work.
How should I code it?:
<!-- À n SEGUNDOS/MINUTOS/HORAS/DIAS/SEMANAS/MESES/ANOS há n tempo-->
<rule id="À-HÁ_N_TEMPO" name="há n tempo">
<pattern>
<marker>
<token skip="-1">à</token>
</marker>
<token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token>
</pattern>
<message>Substituir «à» por <suggestion><match no="1" include_skipped="all"/> <match no="2"/></suggestion>.</message>
<example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example>
</rule>
For example: Conheço a Rita à muito perto de 30 anos
it should match:
The old rule was: <!-- À QUASE há quase --> <rule id="À_QUASE" name="há quase"> <pattern> <marker> <token>à</token> </marker> <token skip="-1">quase</token> <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token> </pattern> <message>Substituir «à» por <suggestion>há</suggestion>.</message> <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example> </rule>
But I wanted it to accept more words than just “quase”
<!-- À n SEGUNDOS/MINUTOS/HORAS/DIAS/SEMANAS/MESES/ANOS há n tempo-->
<rule id="À-HÁ_N_TEMPO" name="há n tempo">
<pattern>
<marker>
<token>à</token>
</marker>
<or>
<token skip="-1" regexp="yes">quase|poucos?|alguns</token>
<token min="0"></token>
</or>
<token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token>
</pattern>
<message>Substituir «à» por <suggestion>há</suggestion>.</message>
<example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example>
</rule>
The rule makes sense but it covers to many situations due to:
<token skip="-1"> this will make is fit a big sentence that has ‘segundos’ on other clauses.
I suggested you the <token min='0'></token> but that can make the rule very strict.
From my understanding, another way that you can try to avoid this type of over generalization is with the exception:
<exception scope="next">and</exception></token> (changing ‘and’ for a very broad verbal postag like VM[CIS][CFIMPS][123][SP]0)
or more easily:
by replacing ‘skip="-1"’ (meaning all tokens until the time expression is found)
by
‘skip=“X”’ or any other number you find reasonable (meaning, skipping 0 to X next words).
I would test the second possiblity first and see how it goes.
This applies to the first rule (the one that I had seen):
<!-- À n SEGUNDOS/MINUTOS/HORAS/DIAS/SEMANAS/MESES/ANOS há n tempo--> <rule id="À-HÁ_N_TEMPO" name="há n tempo"> <pattern> <marker> <token skip="-1">à</token> </marker> <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token> </pattern> <message>Substituir «à» por <suggestion>há</suggestion>.</message> <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example> </rule>
and all related rules.
This will not require you to specify quase|poucos?|alguns
<!-- À n SEGUNDOS/MINUTOS/HORAS/DIAS/SEMANAS/MESES/ANOS há n tempo--> <rule id="À-HÁ_N_TEMPO" name="há n tempo"> <pattern> <marker> <token skip="4">à<exception scope="next" postag='VM[CIS][CFIMPS][123][SP]0' postag_regexp='yes'></exception></token> </marker> <token regexp="yes">segundos?|minutos?|horas?|dias?|semanas?|mês|meses|anos?</token> </pattern> <message>Substituir «à» por <suggestion>há</suggestion>.</message> <example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example> </rule>
I was coming here to actually review the suggestion while I was thinking of it.
I missed the postag when advising. It is a exception scope="next" postag='VM[CIS][CFIMPS][123][SP]0' postag_regexp='yes'></exception>.
If you have to tested it I do not mind pushing. That if the DNS recognizes github again, because after a couple of commits it has stopped finding the servers. Can you check if this is also happening on your side?
Ok. Finally I have pushed my rules and found the issue that made testrules.sh complain.
The rule is fine but ‘rule id’ is duplicated with other rule.
But this rule is unable to find the issue in any of your examples. 4 words limit. You have to decide how many false positives you find acceptable.
Still, after testing the exception rule for all intermediate verbs restricts the rule to its own clause.
Then, I believe this suggestions are perfectly suitable.
You may want to consider review your remaining rules (on a case per case basis) with the same guideline:
Search for skip=‘-1’
See it it makes sense to reduce the scope to skip=‘3’ or skip=‘4’
See if there are exceptions that cancel the rule rule and add them with: <exception scope="next" regexp='yes'>something|other</exception>
Later on, after you see the regression test results and if you are satisfied with the drop in false positives, you may want to review the scope of each rule and see if it can be further extended with postag_regexp.
I do agree entirely with this proposition, but I believe it is important to clarify, that despite the large number of detections on the first iteration of my rules, they are still accurate above the average. Or at least why I think they are.
The set of rules that I have committed so far tests: verbal congruency with the subject, number and gender congruency between all articles, nouns and adjectives. In all romance languages (and many other language groups too), these relations occur at least three times per sentence, and, sometimes, many times more. The set of rules I have committed so far, test all these relations.
So, let us crunch numbers for better understanding.
-Portuguese: 2899 total matches
-Portuguese: ø0,07 rule matches per sentence
+Portuguese: 4468 total matches
+Portuguese: ø0,11 rule matches per sentence
If I understand the number correctly, the set of 4 rules increased false detection by 0,04 matches per sentence. We can estimate false positive rate by:
rule accuracy = rule matches per sentence / average number of tests
(if there is a technical formula more commonly used in this field, please advise)
So, this means that this first set of rules, (that did not account for words that had dictionary issues or have noun and verbal meaning was actually) caused 1 false positive each 50 rule tests (1 / (0,04 / 2)). I am not an expert, but it looks quite good to me. Still, with the first interaction I lowered this value to at least 0,02 matches per sentence (1 in 100 false positives per rule test (1 / ((0,09 - 0,07) / 2))) with just a few tweaks to the rules, as seen below.
-Portuguese: 3760 total matches
-Portuguese: ø0,09 rule matches per sentence
+Portuguese: 4962 total matches
+Portuguese: ø0,12 rule matches per sentence
Even without touching anything else, my overhaul ‘mismatch’ contribution is 0,05 matches per sentence (0,12-0,07) or rule accuracy = 0,05 / 3 ≈ 0,02 (1 in 50 mismatch).
All this assuming that 100% of the detections are false positives for easier calculation reasons. Notice that they are not all false positives. Sometimes they detect real grammar mistakes, and, more often, they detect an incongruency where the orations should have been separated by a comma.
Also, notice that the yesterday’s increase is mostly due to ‘ser’ noun verb property. This, among other things, was fixed immediately after the results. See:
‘Ser’ is a singular masculine noun. In order to understand the importance of this single word, a table with the search results rule_ids in the last regression tests follows:
Last and not the least, I have been very conservative on the estimate of rule checks per sentence, but until I find a tool that automates this counting, these values seam appropriate.
24-10 Regression tests
-Portuguese: 4962 total matches -Portuguese: ø0,12 rule matches per sentence +Portuguese: 4323 total matches +Portuguese: ø0,11 rule matches per sentence