Question about conjugation matching between incorrect and suggestion sentences

baimel · September 21, 2019, 7:33am

Olá! I was having a look through some of the Continental Portuguese rules in the grammar.xml and I came across a couple things I didn’t quite understand in the group about cliches.

Here’s the relevant xml from <!rule id=‘CLICHE_319_BOTAR_2’ name=‘Clichê: Botar para quebrar|mandar ver’> :
>

      <pattern>
          <token inflected='yes'>mandar</token>
          <token>ver</token>
      </pattern>
      <message>Frase-feita. Procure alternativas.</message>
        <suggestion><match no='1' postag='V.+' postag_regexp='yes'>fazer</match> algo com extrema intensidade</suggestion>
      <example correction='fi algo com extrema intensidade|fiz algo com extrema intensidade'><marker>mandei ver</marker>.

When you put any form of the verb mandar into LT on the homepage, it spits out a matching form of fazer, i.e.:

mando → faço
mandas → fazes
manda → faz
etc.

Is there something expressly in the XML here that is doing that matching, or is it something deeper in LT’s code making this happen?

Secondly, I was curious to know why, in the example correction, one of the options is ‘fi algo…’. There’s also an instance of ‘faze algo’ in the related rule about ‘botar para quebrar’. Does this have something to do with the matching that happens somewhere (maybe?) behind the scenes?

Thanks so much for your help!

jaumeortola · September 21, 2019, 9:08am

It is in the XML, the “inflected” attribute. <token inflected='yes'>mandar</token> means any form of the verb mandar. “mando, mandas…” are named “forms”, and the infinitive “mandar” is named “lemma”.

Both forms are in the Portuguese dictionary with the same part-of-speech tags. I don’t know if it is correct.

fi fazer VMIS1S0
fiz fazer VMIS1S0

tiagosantos · September 21, 2019, 11:51am

Those are compressed verbal forms. They happen when the pronouns ‘o’, ‘os’, ‘a’ e ‘as’ occur after some verbal forms (the pronoun is also modified in those cases).
The part-of-speech dictionary and synthesizer is based on project FreeLing. Even with several adaptations, there is some POS information that needs improvement, however, flagging those compressed forms with extra information is very low in the priority list.
If you happen to work on that, pull requests to GitHub - TiagoSantos81/FreeLing: FreeLing project source code are welcome and will, eventually, be integrated in LanguageTool.

baimel · September 24, 2019, 5:33am

Those are compressed verbal forms. They happen when the pronouns ‘o’, ‘os’, ‘a’ e ‘as’ occur after some verbal forms (the pronoun is also modified in those cases).

So, just to make sure I’ve got it, that would be in instances where you have an infinitive plus a pronoun, e.g. in this case, it would be mandá-lo/mandá-las → fazê-lo/fazê-las, correct? It seems like the rule, as written, can’t catch anything like this because those pronouns are separated by the tokenizer. Are my intuitions here correct?

baimel · September 24, 2019, 5:34am

It is in the XML, the “inflected” attribute. <token inflected='yes'>mandar</token> means any form of the verb mandar . “mando, mandas…” are named “forms”, and the infinitive “mandar” is named “lemma”.

Right, I figured that this would be able to pick up any form of mandar. What I’m really curious to know about is how, based on that XML, LT knows to spit out a matching form in the suggestion. For example, if I put in something like ‘se tu mandares ver’, LT parses that ‘mandares’ as VMSF2S0, and then spits out a corresponding ‘fizeres’ in the suggestion, with the same VMSF2S0.

tiagosantos · September 24, 2019, 6:47am

This rule is for cliches only, but the general portuguese rule, which should be in LT, includes not only infinitives but any form that ends in r, s or z, if memory doesn’t fail me. ‘Fiz → Fi’ is one of those cases.

Correct. The main issue is that the synthesizer has the same tag for compressed forms as it has for full forms. When it injects the tag on the suggested verb it will show always all forms that match the postag, so it shows more than it should.

This rule is XML only. Only rules with filter attribute have an extra Java component to them.

jaumeortola · September 24, 2019, 7:59am

This is done by a synthesizer, which uses a synthesis dictionary.

The tagger dictionary takes mandares as input and gives mandar VMSF2S0 as output. You can see the LanguageTool analysis here: Text Analysis - LanguageTool.

The synthesizer does the inverse operation. Takes fazer VMSF2S0 as input and gives fizeres. In the XML rule, the tags are kept, but the lemma is changed:
<suggestion><match no='1' postag='V.+' postag_regexp='yes'>fazer</match></suggestion>.