Improving word synthesis

jaumeortola · February 27, 2017, 1:06pm

It would be very useful if we could synthesize a word taking the POS tag from one token and the lemma from another one. This is very common, for example, in Romance languages. This feature would save us writing dozens of rules to give a proper suggestion. So I would like to implement it.

First, we should agree on the syntax. A <match> inside a <match> probably is not a good idea. Could it be something like this?

<suggestion><match no="2" lemma_from_no="1" postag="N.(..).*" postag_regexp="yes" postag_replace="D..$1.*"> <mach no="2"/></suggestion>

For example, in Spanish, this could be used for generating suggestions like these:
un hombres > unos hombres / un hombre algún hombres > algunos hombres / algún hombre este hombres > estos hombres / este hombre el hombres > los hombres / el hombre ...

All these suggestions can be generated writing more and more XML rules, but at some point it becomes unmanageable.

tiagosantos · February 27, 2017, 8:22pm

That would be brilliant. Simplifying agreement rules in Portuguese would be a real improvement, only possible with this.
The new keyword seems good. When testing the possibilities I tried something similar to this:

<suggestion><match no="2" postag="N.(..).*" postag_regexp="yes" postag_replace="D..$1.*">\1<mach no="2"/></suggestion>
\1
I have no idea how difficult this solution may be, or if ambiguities between the word and its lemma may produce odd results. Anyway, I leave the idea here as a suggestion.
I look foward to see this feature develope.

Best regards.

curon · March 2, 2017, 12:22am

You have been able to work around some of this in the Catalan grammar.xml using unification, which is a nice trick, although this is limited to the situation where there is a binary option, which would not be the case for grammatical person for example.

Maybe allowing a match child element within the match element would work, and would be more consistent with the current ability to set any fixed lemma, such as the following:
<match no="1" postag="N.(..).*" postag_regexp="yes"><match no="2"/></match>

jaumeortola · March 2, 2017, 12:41pm

That’s a possibility. But we will need to do a full synthesis in the nested match element. The word form can be plural or femenine, and we need the lemma, that usually is the masculine singular form. The notation of all this will be quite cumbersome.

The problem with my first proposal (lemma_from_no=“1”) is that a token can have several lemmas. So if there is no other parameter, you’ll need to try to synthesize with every lemma.

I’m not sure what is the best way.

Besides, not being fully familiar with this part of the code, the implementation is not easy for me.