[en] Disambiguation of incorrect grammar

I think that we need a policy about whether to disambiguate incorrect grammar. Consider the following sentences:

  • Incorrect grammar: Those machines can squashed.
  • Correct grammar: Those machines can the tomatoes very quickly.
  • Correct grammar: Those machines can squashed tomatoes; we sell the unsquashed tomatoes at a premium price.

In all cases, E_NP_VBP[1] gives the postag VBP to the word ‘can’.

For the first sentence, do we parse ‘can’ as VBP or do we say that the sentence is incorrect and thus to parse ‘can’ as VBP does not make sense?

My preference is not to parse incorrect grammar. During the past few months, if I found a disambiguation for text that is incorrect grammar, I have tried to change the disambiguator such that it does not disambiguate the text. In general, is this a sensible strategy?

I wrote ‘in general’, because rulegroup DID_BASEFORM rule 1 finds this incorrect sentence:

  • A proposed northern bypass of Birmingham will designated as I-422.

It finds the sentence because disambiguation WILL_MD gives the POS MD to ‘will’ although the sentence is incorrect grammar.

I have not changed WILL_MD, because when I do, DID_BASEFORM rule 1 does not find the incorrect text. Is not finding incorrect text a good reason not to change the disambiguation?

What I am trying to say (and clarify in my mind) is that if we apply postags to incorrect text, then (despite the counter-example of ‘will designated’) how can we expect the grammar rules that use postags to give a correct analysis of incorrect text?

Just my two cents on this @Mike_Unwalla.

I believe that English disambiguation is mature and well-developed. So much so, that now, the task you prioritized is to produce antipatterns for errors in disambiguation, since it is very often over disambiguating.
Touching disambiguation is tricky, but given the positive regression test results you have been having, I would keep doing what you are doing. It is working so far.
As such, I guess that the way forward is to interactively improve rules that lose functionality - i.e. detection capability - when you happen to improve a related disambiguation rule.

I agree. Have you ever tried the raw_pos attribute?

@tiagosantos, @Jan_Schreiber, thanks for your comments. Good to have confirmation that I am not doing something stupid.

Yes, I have tried raw_pos. It is useful, but not as much as the antipatterns for disambiguation, which are really great. A limit of raw_pos it that it applies to a pattern, not to a token.

1 Like