Undected errors

After 10 years of working (on and off) on LT, I came to the conclusion it is not possible to catch all errors, due to the flexibility of the language.
But I think it can be approached by a different method than detecting ‘known errors’.

Currently I am experimenting as follows:

  • I pick the top frequent word
  • make a rule to catch that word and the next one
  • make antipatterns to get rid of all allowed constructions (easily found by the amount of hits from a corpus)
  • what remains after having all correct patterns converted into an anti-pattern for this rule, what remains is an error in the area of the word.
    The most hitting ones could be transformed into a rule detecting that error (the old method); what remains after that could have a warning ‘weird word pattern; please check thoroughly’.

It is a lot of work, but it detects all kinds of irregular word combinations that would normally not appear on the radar. It also helps to get the postags improved.

Maybe this way of making rules is valuable for other languages as well.

2 Likes