There seems to be two problems in how we are handling ignored characters:
with ignored characters the token positions are wrong:
a) we don’t process ignored character in the first token
b) after we adjust token positions correctly, we reset them back
Attached below is a fix and test case for problem 1)
This one is trickier - when we find token with ignored characters we tag the token without them but then add extra reading with original token (and null tags)
a) question, is it really necessary? My only guess is that there could be rule looking for original token with hidden chars, then this reading will trigger the match (but if rule also requires tags or looks for lemma it won’t work), also for most languages the hidden chars are soft hyphen, not sure if anybody looks for words with those, in Ukrainian we also ignore accent char
b) when we add this additional reading if it’s token length is longer (which is the case here) it replaces readings.token with this new reading token; I could assume it could be justified (although I am not sure how it affects the rules) but the problems is that xml rule disambiguator (at least in its filter code) replaces the reading.token with this (original) readings.token
E.g. text: A\u00ADB will be tagged as
But if this sentences is passed via disambig and this token is filtered, we get this back:
This makes output for such cases unreliable - readings.token changes based on if it was disambiguated
I think we should either not change readings.token back to the word with ignored chars when we add extra reading or make disambiguator not revert to original token.
I suspect first case is more straightforward as there may be more code that looks into readings.token (or replaces reading[i].token with readings.token).
lt-ignored-chars-positions-fix.patch (3.9 KB)