Hi, as I mentioned there is a couple of things I noticed while developing nlprule which I thought are worth discussing.
These are minor things. In my opinion the rule syntax, format and functionality of LT rules are good.
id=“no”
In the Spanish disambiguation.xml
there is a rule with id="no"
here. I believe this rule is unintentionally turned off because the id is falsy. Enabling it causes some grammar rules to fail and I was not able to trigger it in any way in LT.
Invalid regex
The English rule INCORRECT_POSSESSIVE_FORM_AFTER_A_NUMBER contains this exception:
<exception regexp="yes">^[\p{L}&&[^aeiuo]]+$</exception>
. I’m pretty sure this regex does not do what it’s supposed to do due to the &&
. I also don’t understand the semantics of this regex besides that error.
Invalid match references
For example in the English rule PRP_PAST_PART there is this suggestion: <suggestion>\1 has \3</suggestion>
although the rule has only 2 tokens. Apparently LT automatically corrects this to use the last token, but in my opinion this error should be caught at rule creation time to avoid errors down the line.
Multiple markers
LT currently allows more than one marker per rule. For example in POS_N in the Spanish disambiguation rules.
I think LT should here again prohibit this instead of just using the inner marker.
The last two are technically not errors but opportunities for LT to be a bit more strict while parsing the rules. There’s also two questions I’d have:
Disambiguation filter does not keep a matching tag
In the English rule BEST_JJS the action is <disambig postag="JJS"></disambig>
and there is this example: <example inputform="best[best/NN:U,best/VB,best/VBP,good/JJS,well/JJS,well/RBS]" outputform="best[good/JJS]" type="ambiguous">It's <marker>best</marker> for him.</example>
.
Why is well/JJS
not kept here? In other rules (e.g “LET_GO” in the same file) all of the tags matching the filter are kept.
Order of precedence in synthesizer
I don’t quite understand the order of precedence in the synthesizer. E. g. with <match no="2" postag="VBP"></match>
(in English NON3PRS_VERB[4]) there is this example <example correction="are">You <marker>is</marker> a good engineer.</example>
while with the same match in NON3PRS_VERB[1] there is this example: <example correction="have">I rarely <marker>has</marker> a bad day.</example>
.
The tag dump looks roughly like this:
[...]
am be VBP
[...]
are be VBP
[...]
has have VBZ
[...]
have have VBP
[...]
haven have VBP
[...]
is be VBZ
[...]
So how does LT select “have” and “are” here? Only considering the order in the tag dump I was not able to do this correctly, is there something else involved?
I hope this is the right place for this discussion, and thanks for any help!