[pt] Problem developing rule "Vou apresentar de seguida"

Hello!

While looking at my thesis I found out that I could improve the grammar at the start of subchapters, by replacing for example:
“Vamos apresentar a seguir” with “Apresentamos a seguir”

So, I have spent some two hours creating the rule.

The problem is that TESTRULES PT gives errors everywhere in it.

So, I tried to break it down and trying to fix the first personal verb:

<rulegroup id='IR_VERBO-A_DE-SEGUIR-SEGUIDA-INFINITIVO' name="Ir_verbo + a/de seguir/seguida + Verbo_inf">
	<!--      Created by Marco A.G.Pinto, Portuguese rule 2021-02-1 (1-JAN-2021+)      -->
<!--
=EU=
Vou explicar a seguir o processo.
Vou a seguir explicar o processo.
Vou de seguida explicar o processo.
=TU=
Vais explicar a seguir o processo.
Vais a seguir explicar o processo.
Vais de seguida explicar o processo.
=ELE=
Vai explicar a seguir o processo.
Vai a seguir explicar o processo.
Vai de seguida explicar o processo.
=NÓS=
Vamos explicar a seguir o processo.
Vamos a seguir explicar o processo.
Vamos de seguida explicar o processo.
=VÓS/VOCÊS/ELES=
Vão explicar a seguir o processo.
Vão a seguir explicar o processo.
Vão de seguida explicar o processo.
-->	

	  <!-- EU -> APRESENTEI -->
	  <rule> 
		<pattern>
		   <marker>
			<and>
				<token inflected='yes'>ir</token>
				<token postag='VMIP1S0' postag_regexp='no'/>
			</and>
			<token min="0" max="1" regexp='yes'>a|de</token>
			<token min="0" max="1" regexp='yes'>seguir|seguida</token>
			<token postag='VMN0000' postag_regexp='no'/>
		   </marker>
		</pattern>
		<message>Em certos contextos, esta perífrase pode ser simplificada.</message>
		<suggestion><match no='4' postag='VMIP1S0' postag_regexp="yes" postag_replace='VMN0000'/> \2 \3</suggestion>
		<example correction='Explico a seguir'><marker>Vou a seguir explicar</marker> o processo.</example>
  </rule>
  
	</rulegroup>

But I get the error:

Testing rule 2600…
Skipped 0 rules for variant language to avoid checking rules more than once
2671 rules tested.
Exception in thread “main” org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule IR_VERBO-A_DE-SEGUIR-SEGUIDA-INFINITIVO[1] in file /org/languagetool/rules/pt/grammar.xml: Incorrect match position markup (expected match position: 0 - 21, actual: 0 - 12) in sentence: Vou a seguir explicar o processo.
at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:310)
at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:447)
at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:339)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Running disambiguator rule tests…
Running disambiguation tests for Portuguese…
100…
200…
290 rules tested (397ms)
Disambiguator tests successful.
Running XML bitext pattern tests…
Bitext pattern tests successful.
Validating false-friends.xml…
Validation successfully finished.

Could Jaume or someone give me a tip on how to fix this?

Then I will apply the fix to the other parts.

Maybe it is something very simple that is missing or wrong.

Thank you!

@jaumeortola

Any clues?

Thanks!

The first problem is that “seguir” can match the third token and also the fourth token (as an infinitive).

The second problem is that the POS tags in the suggestion are swapped. This is the fixed rule:

<rule> 
     <pattern>
       <marker>
         <and>
           <token inflected='yes'>ir</token>
           <token postag='VMIP1S0' postag_regexp='no'/>
         </and>
         <token min="0" max="1" regexp='yes'>a|de</token>
         <token min="0" max="1" regexp='yes'>seguir|seguida</token>
         <token postag='VMN0000' postag_regexp='no'><exception>seguir</exception></token>
       </marker>
     </pattern>
     <message>Em certos contextos, esta perífrase pode ser simplificada.</message>
     <suggestion><match no='4' postag='VMN0000' postag_regexp="yes" postag_replace='VMIP1S0'/> \2 \3</suggestion>
     <example correction='Explico a seguir'><marker>Vou a seguir explicar</marker> o processo.</example>
   </rule>

If I understand the rule, the suggestion is synthesized with the lemma of the fourth token and the POS tag of the first one. This cannot be done with the usual synthesizer, but it can be done with a new filter. Write just one rule (instead of six rules for every verb person and number), and I will add the filter to Portuguese.

@jaumeortola

Thank you!

At 5am I will create the rule.

I have created similar rules using six rules, one for each person.

I don’t know what a filter is (how it works?) or do you mean that I write a rule that shows the verb in all persons, and then you will add something that shows only one person in the results?

Something like:
“Vamos a seguir apresentar”
would suggest in one rule all the 6 persons and then you will create a filter to show just one?
“Apresento a seguir”
“Apresentas a seguir”
“Apresenta a seguir”
“Apresentamos a seguir”
“Apresentam a seguir”

At 5am I will do it.

Thanks!

@jaumeortola

Hello!

I have just committed the rule:

Can you add the filter?

Please notice that there is an issue:

=EU=
Vou explicar a seguir o processo.
Vou a seguir explicar o processo.
Vou de seguida explicar o processo.
=TU=
Vais explicar a seguir o processo.
Vais a seguir explicar o processo.
Vais de seguida explicar o processo.
=ELE/ELA=
Vai explicar a seguir o processo.
Vai a seguir explicar o processo.
Vai de seguida explicar o processo.
=NÓS=
Vamos explicar a seguir o processo.
Vamos a seguir explicar o processo.
Vamos de seguida explicar o processo.
=VÓS=
Ides explicar a seguir o processo.
Ides a seguir explicar o processo.
Ides de seguida explicar o processo.
=ELES/ELAS=
Vão explicar a seguir o processo.
Vão a seguir explicar o processo.
Vão de seguida explicar o processo.

The first of each gets an extra blank space when we apply the suggestion. I don’t have a clue why it is happening.

Vou explicar a seguir o processo.
Vais explicar a seguir o processo.
Vai explicar a seguir o processo.
Vamos explicar a seguir o processo.
Ides explicar a seguir o processo.
Vão explicar a seguir o processo.

Also, it suggests too many replacements and only 5 are shown (I believe the filter will solve that?)

Thanks!

I have a hunch this is the same kind of problem I’m encoutering in a rule that requires exchanging the position of two tokens (O+Vb → Vb+O)

org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CLITICOS[1] in file /org/languagetool/rules/gl/grammar.xml: Incorrect match position markup (expected match position: 0 - 6, actual: 0 - 2) in sentence: Mo deu onte.

Here is the rule:

	<rule name="Clíticos en inicio de oración">
		<pattern>
			<token postag='SENT_START'></token>
			<marker>
				<token postag='PP.[MFC][SP].00(:PP.[MFC][SP].00)*' postag_regexp='yes'><exception postag='PP.[MF]S000|PP1CSN00|PP1CPO00|PP2CSO00|PP3.P[A0]00|PP3CSO00|PPC[PS]000|DA0.S0' postag_regexp='yes'></exception></token>				
			</marker>
			<token postag='V.*' postag_regexp='yes'></token>
		</pattern>
		<message>Os pronomes clíticos non poden ir no inicio da oración. Por regra xeral, o pronome átono vai colocado detrás do verbo: <suggestion><match no="3" case_conversion="startupper" /><match no="2" case_conversion="startlower" /></suggestion> no canto de «\2 \3»).</message>
		<example correction="Deumo"><marker>Mo deu</marker> onte.</example>
		<example><marker>Deumo</marker> onte.</example>
	</rule>

This is the first of about a dozen rules that will tackle one of the hardest parts of Galician grammar. I don’t understand what the ‘match position’ is or what 0 - 6 and 0 -2 mean, and any pointers will be welcome.

The <marker> positions inside <pattern> should match the <marker> positions inside <example>.

<rule name="Clíticos en inicio de oración">
		<pattern>
			<token postag='SENT_START'></token>
			<marker>
				<token postag='PP.[MFC][SP].00(:PP.[MFC][SP].00)*' postag_regexp='yes'><exception postag='PP.[MF]S000|PP1CSN00|PP1CPO00|PP2CSO00|PP3.P[A0]00|PP3CSO00|PPC[PS]000|DA0.S0' postag_regexp='yes'></exception></token>				
    			<token postag='V.*' postag_regexp='yes'></token>
			</marker>
		</pattern>
		<message>Os pronomes clíticos non poden ir no inicio da oración. Por regra xeral, o pronome átono vai colocado detrás do verbo: <suggestion><match no="3" case_conversion="startupper" /><match no="2" case_conversion="startlower" /></suggestion> no canto de «\2 \3»).</message>
		<example correction="Deumo"><marker>Mo deu</marker> onte.</example>
		<example><marker>Deumo</marker> onte.</example>
	</rule>

If you just need to change the order of the elements, you probably don’t need a filter here (at least in this rule).

(:disappointed_relieved: feeling a bit silly, it was so obvious) - Thanks!

The problem I’m facing now is that enclitic pronouns change the accentual pattern of the verb. Unlike Portuguese, they’re not added using a hyphen, but attached directly. When moving them from before the verb to after it I’ve simply copied the verb and added the pronouns. Sometimes it works:

  • Te chamei -> Chameite
  • Me dis -> Disme

but, more often than not, an extra accent is needed:

  • Mo dixeras -> Dixérasmo
  • Volo dixen -> Díxenvolo
  • Me contas -> Cóntasmo
  • Me sorprende -> Sorpréndeme

‘Dixérasmo’, for instance, is VMIM2S0:PP1CS000:PP3MSA00, (‘Mo dixeras’ is PP1CS000:PP3MSA00 and VMIM2S0). I’ve tried this:

<match no="3" case_conversion="startupper" postag="$3$2" postag_regexp="yes" />

but without any success: I get the verb between parenthesis and without the accent and the pronoun, ‘(Dixeras)mo’.

This cannot be done with simple XML rules. You need some filter. A filter is just an extension to XML rules.

I think that one existing filter can be adapted to do this. Add the rule, and I will add the suggestions.

I’m not sure I understand what a filter is. Does it take the output of a rule and further processes it so as to, for instance, check the spelling?
Do I submit the code on GitHub?

Done here: [gl] new rule CLITICOS_INICIO with AdvancedSynthesizerFilter · languagetool-org/languagetool@1143282 · GitHub
It is a new feature, and it is a bit experimental. It is explained here: Advanced synthesizer: another feature · Issue #4325 · languagetool-org/languagetool · GitHub

The suggestions should be capitalized, but they are not. I have to think a solution for this.

A filter can add conditions for a rule to match, and it can change or synthesize suggestions. It is written in Java.

I think I understand. However, when launching the standalone java GUI I get this:

As there is new Java code, you need to rebuild with: mvn package -DskipTests.

Excellent! I’ve run the tests against my corpus of errors and Wikipedia and it works fine. From what I gather from Advanced synthesizer: another feature · Issue #4325 · languagetool-org/languagetool · GitHub, the only issue now would be that <match> cannot be used until this tag gets the ability to use those attributes. It will do as it is in the meantime, I think. I’ll update the pared-down version of the rule with the full one.

1 Like

For the rest of the rules, I need the reverse (without capitals):

  • dixérasmo → mo dixeras
  • díxenvolo → volo dixen
  • cóntasmo → me contas
  • sorpréndeme → me sorprende

For instance:

<rule id="CLITICS_03" name="Cliticos en oracións negativas">			
  	<pattern>
  			<token regexp="yes">endexamais|nada|ninguén|non|nunca|xamais</token>
  		<marker>
  			<token postag="VM.*:PP.*" postag_regexp="yes"></token>
  		</marker>
  	</pattern>
    <filter class="org.languagetool.rules.gl.AdvancedSynthesizerFilter" args="lemmaFrom:2 lemmaSelect:V(.*):(PP.*) postagFrom:2 postagSelect:(V.*) postagReplace:V\a1"/>
	<message>Nas oracións negativas os pronomes átonos van antes do verbo.</message>
    <suggestion>{suggestion}</suggestion>
	<example correction="te coñezo">Non <marker>coñézote</marker>.</example>
	<example>Non <marker>te</marker> coñezo.</example>
</rule>

I can retrieve the verb but I don’t know how to get the pronoun.

This can be solved with two <match> elements, without the filter:

<rule id="CLITICS_03" name="Cliticos en oracións negativas">			
  	<pattern>
  			<token regexp="yes">endexamais|nada|ninguén|non|nunca|xamais</token>
  		<marker>
  			<token postag="VM.*:PP.*" postag_regexp="yes"><exception postag="VM....." postag_regexp="yes"/></token>
  		</marker>
  	</pattern>
    
	<message>Nas oracións negativas os pronomes átonos van antes do verbo.</message>
    <suggestion><match no="2" regexp_match="(?iu).*(te)$" regexp_replace="$1"/> <match no="2" postag="(VM.....):PP.*" postag_regexp="yes" postag_replace="$1"/></suggestion>
	<example correction="te coñezo">Non <marker>coñézote</marker>.</example>
	<example>Non <marker>te</marker> coñezo.</example>
	<example>Non creo que ela volva antes das cinco.</example>
</rule>

You need to add a regular expression with all possible pronoun forms (in the suggestion, but probably also in the pattern). In my example, it is only for “te”.