Xml pattern across sentences

Ruud_Baars · November 5, 2019, 8:31am

Would it be possible to make it possible to check around detected sentence endings in xml?
Example:
Vrij. 21 januari
creates a sentence end after vrij because of the srx. Which is okay, since vrij is a normal word, and not an abbreviation.
If it were possible to check for
<token>vrij</token><token postag="SENT_END">.</token><token regexp="yes">[0-9]{1,2}</token>
it would be possible to warn that there is no need for the . after vrij.

dnaber · November 5, 2019, 9:21am

I think this would be quite some work and maybe cause many internal changes, so I don’t think it’s worth it yet.

Ruud_Baars · November 5, 2019, 9:39am

That is a pity. I will put it on the whish list…

arysin · November 5, 2019, 9:56pm

There’s text-level rules that allow to check things across sentences but I think they are only available in Java (not xml).
Alternatively you could prevent such cases to be split into sentences via srx, e.g. in Ukrainian there’s often a mistake of putting period after “млн (million)” so if there’s a digit before and after “млн.” we make exception and don’t break sentence and then you can easily catch it in the rule.

Ruud_Baars · November 6, 2019, 7:01am

That would mean making exceptions in the (already hardly understandable) srx as well as adding it to a rule; two changes for 1 error. Feels not very easy to maintain.

I will just leave it for now.