[pt] Problem creating rule – 2022-01-11

marcoagpinto · January 11, 2022, 12:08am

I was trying to clear and improve the “à → há” rule, by creating a new, better one:

	<!-- À há -->
    <rule id='CONFUSÃO_À_HÁ_V2' name="Confusão: à/há (expressões de tempo/quantidade) V2">
    <!--      Created by Marco A.G.Pinto, Portuguese rule 2022-01-08 (1-JAN-2022+)      -->
	<!--
Ainda à muito para fazer. → Ainda há muito para fazer.
Ainda à pouco para fazer. → Ainda há pouco para fazer.
Ainda à bastante para fazer. → Ainda há bastante para fazer.
Ainda à imenso para fazer. → Ainda há imenso para fazer.
Conheço a Ana à quase 30 anos. → Conheço a Ana há quase 30 anos.
O show aconteceu à aproximadamente dois meses. → O show aconteceu há aproximadamente dois meses.
Esteve à margem à aproximadamente dois meses. → Esteve à margem há aproximadamente dois meses.
Tudo à bastante tempo. → Tudo há bastante tempo.
Conheço a Ana à 30 anos. → Conheço a Ana há 30 anos.
Conhecemos os maias à séculos. → Conhecemos os maias há séculos.
Nos barcos à toneladas de peixe pescado. → Nos barcos há toneladas de peixe pescado.
Tudo à minutos. → Tudo há minutos.

NOTE THAT THIS RULE HAS BEEN REWRITTEN AND IT IS NOW VERY SMALL IN CODE AND MORE ACCURATE.
	-->
		<pattern>
			<token postag='SENT_START|_PUNCT|NC.+|AQ.+|NP.+|RG|PP.+|V.+|PI0NN000' postag_regexp='yes'/>
			<marker>
				<token skip="3" regexp='no'>à
					<exception scope="next" regexp="yes">à|há</exception>
				</token>
			</marker>
			<token regexp="yes">&adverbios_de_intensidade;&expressoes_de_tempo;|&unidades_de_medida;|&unidades_de_medida_por_extenso;</token>
<!--			
				<exception postag_regexp='no' postag='RN'/>
			</token>
-->
		</pattern>
		<message>Para expressões de tempo/quantidade utilize 'há'.</message>
		<suggestion>há</suggestion>
		<example correction="há">Ainda <marker>à</marker> muito para fazer.</example>
		<example correction="há">Ainda <marker>à</marker> pouco para fazer.</example>
		<example correction="há">Ainda <marker>à</marker> bastante para fazer.</example>
		<example correction="há">Ainda <marker>à</marker> imenso para fazer.</example>
		<example correction="há">Conheço a Ana <marker>à</marker> quase 30 anos.</example>
		<example correction="há">O show aconteceu <marker>à</marker> aproximadamente dois meses.</example>		
		<example correction="há">Esteve à margem <marker>à</marker> aproximadamente dois meses.</example>
		<example correction="há">Tudo <marker>à</marker> bastante tempo.</example>
		<example correction="há">Conheço a Ana <marker>à</marker> 30 anos.</example>
		<example correction="há">Conhecemos os maias <marker>à</marker> séculos.</example>
		<example correction="há">Nos barcos <marker>à</marker> toneladas de peixe pescado.</example>
		<example correction="há">Tudo <marker>à</marker> minutos.</example>	
	</rule>

However, TESTRULES PT throws errors:

Testing rule 2800…
Skipped 0 rules for variant language to avoid checking rules more than once
2816 rules tested.
Exception in thread “main” org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CONFUS?O_?_H?_V2[1] in file /org/languagetool/rules/pt/grammar.xml: “Conheço a Ana à 30 anos.”
Errors expected: 1
Errors found : 0
Message: Para express?es de tempo/quantidade utilize ‘há’.
Analyzed token readings: [/SENT_START*] Conheço[conhecer/VMIP1S0*] [ /null*] a[a/SPS00] [ /null*] Ana[Ana/NPFS000,Ana/NPFSS00] [ /null*] à[a+a/SPS00+] [ /null] 30[30/Z0CN0] [ /null*] anos[ano/NCMP000] .[./SENT_END*,./PUNCT*]
Matches: []
at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:322)
at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:454)
at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:357)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CONFUS?O?_H?_V2[1] in file /org/languagetool/rules/pt/grammar.xml: “Tudo à minutos.”
Errors expected: 1
Errors found : 0
Message: Para express?es de tempo/quantidade utilize ‘há’.
Analyzed token readings: [/SENT_START*] Tudo[tudo/PI0NN000*] [ /null*] à[a+a/SPS00+] [ /null] minutos[minuto/NCMP000] .[./SENT_END*,./_PUNCT*]
Matches:
at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:322)
at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:454)
at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:357)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Running disambiguator rule tests…
Running disambiguation tests for Portuguese…

But if you try those two errors in the standalone tool, the rule works for those two errors above:

“Conheço a Ana à 30 anos.”
“Tudo à minutos.”

What am I doing wrong?

As a last option, I can simply remove those two sentences from the correct/incorrect and the rule will still work, but there must be a solution to avoid doing that.

Thanks!

marcoagpinto · January 11, 2022, 10:00am

Ahhhhh… I have just decided that I am not going to code this new version of the rule since it produces tons of false positives when compared to the more complex one.