I do agree entirely with this proposition, but I believe it is important to clarify, that despite the large number of detections on the first iteration of my rules, they are still accurate above the average. Or at least why I think they are.
The set of rules that I have committed so far tests: verbal congruency with the subject, number and gender congruency between all articles, nouns and adjectives. In all romance languages (and many other language groups too), these relations occur at least three times per sentence, and, sometimes, many times more. The set of rules I have committed so far, test all these relations.
So, let us crunch numbers for better understanding.
After the first set of rules was committed, the WikiChecks results were:
-Portuguese: 2899 total matches
-Portuguese: ø0,07 rule matches per sentence
+Portuguese: 4468 total matches
+Portuguese: ø0,11 rule matches per sentence
If I understand the number correctly, the set of 4 rules increased false detection by 0,04 matches per sentence. We can estimate false positive rate by:
rule accuracy = rule matches per sentence / average number of tests
(if there is a technical formula more commonly used in this field, please advise)
So, this means that this first set of rules, (that did not account for words that had dictionary issues or have noun and verbal meaning was actually) caused 1 false positive each 50 rule tests (1 / (0,04 / 2)). I am not an expert, but it looks quite good to me. Still, with the first interaction I lowered this value to at least 0,02 matches per sentence (1 in 100 false positives per rule test (1 / ((0,09 - 0,07) / 2))) with just a few tweaks to the rules, as seen below.
The full set of rules I described before, present in yesterday’s results, shows:
-Portuguese: 3760 total matches
-Portuguese: ø0,09 rule matches per sentence
+Portuguese: 4962 total matches
+Portuguese: ø0,12 rule matches per sentence
Even without touching anything else, my overhaul ‘mismatch’ contribution is 0,05 matches per sentence (0,12-0,07) or rule accuracy = 0,05 / 3 ≈ 0,02 (1 in 50 mismatch).
All this assuming that 100% of the detections are false positives for easier calculation reasons. Notice that they are not all false positives. Sometimes they detect real grammar mistakes, and, more often, they detect an incongruency where the orations should have been separated by a comma.
Also, notice that the yesterday’s increase is mostly due to ‘ser’ noun verb property. This, among other things, was fixed immediately after the results. See:
‘Ser’ is a singular masculine noun. In order to understand the importance of this single word, a table with the search results rule_ids in the last regression tests follows:
Last and not the least, I have been very conservative on the estimate of rule checks per sentence, but until I find a tool that automates this counting, these values seam appropriate.
24-10 Regression tests
-Portuguese: 4962 total matches -Portuguese: ø0,12 rule matches per sentence +Portuguese: 4323 total matches +Portuguese: ø0,11 rule matches per sentence
ERRO_DE_CONCORDNCIA_DO_GÉNERO_MASCULINO -563 ERRO_DE_CONCORDNCIA_DO_GÉNERO_MASCULINO_2 -107