Findings from nlprule development

bminixhofer · January 29, 2021, 10:15am

Hi, as I mentioned there is a couple of things I noticed while developing nlprule which I thought are worth discussing.

These are minor things. In my opinion the rule syntax, format and functionality of LT rules are good.

id=“no”

In the Spanish disambiguation.xml there is a rule with id="no" here. I believe this rule is unintentionally turned off because the id is falsy. Enabling it causes some grammar rules to fail and I was not able to trigger it in any way in LT.

Invalid regex

The English rule INCORRECT_POSSESSIVE_FORM_AFTER_A_NUMBER contains this exception:

<exception regexp="yes">^[\p{L}&&[^aeiuo]]+$</exception>. I’m pretty sure this regex does not do what it’s supposed to do due to the &&. I also don’t understand the semantics of this regex besides that error.

Invalid match references

For example in the English rule PRP_PAST_PART there is this suggestion: <suggestion>\1 has \3</suggestion> although the rule has only 2 tokens. Apparently LT automatically corrects this to use the last token, but in my opinion this error should be caught at rule creation time to avoid errors down the line.

Multiple markers

LT currently allows more than one marker per rule. For example in POS_N in the Spanish disambiguation rules.

I think LT should here again prohibit this instead of just using the inner marker.

The last two are technically not errors but opportunities for LT to be a bit more strict while parsing the rules. There’s also two questions I’d have:

Disambiguation filter does not keep a matching tag

In the English rule BEST_JJS the action is <disambig postag="JJS"></disambig> and there is this example: <example inputform="best[best/NN:U,best/VB,best/VBP,good/JJS,well/JJS,well/RBS]" outputform="best[good/JJS]" type="ambiguous">It's <marker>best</marker> for him.</example>.

Why is well/JJS not kept here? In other rules (e.g “LET_GO” in the same file) all of the tags matching the filter are kept.

Order of precedence in synthesizer

I don’t quite understand the order of precedence in the synthesizer. E. g. with <match no="2" postag="VBP"></match> (in English NON3PRS_VERB[4]) there is this example <example correction="are">You <marker>is</marker> a good engineer.</example> while with the same match in NON3PRS_VERB[1] there is this example: <example correction="have">I rarely <marker>has</marker> a bad day.</example>.

The tag dump looks roughly like this:

[...]
am	be	VBP
[...]
are	be	VBP
[...]
has	have	VBZ
[...]
have	have	VBP
[...]
haven	have	VBP
[...]
is	be	VBZ
[...]

So how does LT select “have” and “are” here? Only considering the order in the tag dump I was not able to do this correctly, is there something else involved?

I hope this is the right place for this discussion, and thanks for any help!

Mike_Unwalla · January 29, 2021, 11:26am

@dnaber, I fixed this error in [en] Improve PRP_PAST_PART · languagetool-org/languagetool@07f9d65 · GitHub.

@dnaber, I agree strongly. Please add a check in testrules.

jaumeortola · January 29, 2021, 12:23pm

Thanks for your comments. Regarding the Spanish rules:

The problem with the rule id=“no” was a wrong tag (RG instead of RN). Anyway, other rules take care of this. I couldn’t reproduce any test failure: [es] fix disambiguation rule: no_adv · languagetool-org/languagetool@9a30442 · GitHub
I don’t understand the issue about “Multiple markers” (POS_N disambiguation rule). There can only be one <marker></marker> on each rule with one or more tokens inside.

jaumeortola · January 29, 2021, 12:49pm

Disambiguation filter does not keep a matching tag
English disambiguation.xml, see the difference:
In BEST_JJS: <disambig postag="JJS"/>
In LET_GO: <disambig action="filter" postag="VB.*"/>
The first disambiguation is certainly ambiguous. It shouldn’t be used here. When there is no action defined, the action done seems to be “replace”. I don’t know how the lemma is selected.

Order of precedence in synthesizer
Some forms were removed from the synthesis dictionary because they are usually undesired when generating suggestions. Another solution would be to add a special tag just for the form “am”.

bminixhofer · January 29, 2021, 2:07pm

There can only be one <marker></marker> on each rule with one or more tokens inside.

Yes, that’s the issue. There are two <marker>s in one <pattern> here and LT allows it because one is in the unify: languagetool/languagetool-language-modules/es/src/main/resources/org/languagetool/resource/es/disambiguation.xml at febab7bbc4fbf7636b45831929740e9c4f8443a5 · languagetool-org/languagetool · GitHub

The first disambiguation is certainly ambiguous. It shouldn’t be used here. When there is no action defined, the action done seems to be “replace”. I don’t know how the lemma is selected.

Thanks, I didn’t look closely enough searching for a counterexample, the issue is indeed how the lemma is selected. It is a bit weird because nlprule only does the wrong thing in this one case (BEST_JJS).

Some forms were removed from the synthesis dictionary because they are usually undesired when generating suggestions. Another solution would be to add a special tag just for the form “am”.

The information from filter-archaic.txt seems to be what i was missing, thanks!

@Mike_Unwalla The one other place where I noticed this issue is in the spanish EN_TORNO rule. Maybe happens in other languages as well, so a test is the best solution

Mike_Unwalla · January 29, 2021, 3:37pm

I will leave the changes to the developers of Spanish. I’ve asked Daniel to put a check in our test routine.

dnaber · January 29, 2021, 9:37pm

Nesting <marker>s now causes an error (commit).