Extra tags for AnalyzedToken

arysin · May 20, 2020, 6:55pm

I am feeling we could improve analyzed token by allowing more tags which are not POS tags.
I am currently putting some tags in POS tag dictionary that are of additional help but are not pure POS tags. If we had extra tags we could put such information there.

The idea is to add a field AnalyzedToken.extraTags, e.g.:
private Map<String,String> extraTags;
The key here is category and value is set of tags. E.g. “pre-disambig: noun:m:…” or "
This could be used in several ways:

disambiguator could put original (pre-disambig) tags there (if needed)
tagger could put additional tags so that disambig/rules could use it later
Tagger could use that field for some dynamic properties (e.g. token was tagged dynamically, without the dictionary) or users can create additional dictionaries.

It could be then used in the rules with something like this:
<token postag="..." extra_tag="category1:tag1">

dnaber · May 20, 2020, 7:06pm

It might also be useful for debugging e.g. the German compound splitter.

Ruud_Baars · May 21, 2020, 6:26am

I second this! Could be very useful for all kind of word markings!

arysin · May 21, 2020, 6:59pm

I think we could add a field and start using it in Java code first, and then later extend it to grammar.xml/disambiguation.xml

The question is what type of field should it be:1

plain string String extraTag - this has benefit of being similar to posTag and using same mechanisms
Set<String> - (as unlike posTag these tag may be not related at all) this has benefit of being able to handle tags in more independent ways
Map<String,Set<String>> - category to set of tags - this allows to separate tag sets into different categories, e.g. “dynamicTagging”, “disambig”, “semantic” etc will all have different set of tags

3rd IMHO is most properly designed and scalable but may impose bigger changes to xml handling.

dnaber · May 22, 2020, 7:14am

I agree. One could think about using Map<String,List<String>> so there’s order information in the values.

arysin · May 25, 2020, 6:29pm

Good suggestion. Shall we create a branch for this or just add a field in master (maybe marked experimental) and start working with it?

dnaber · May 26, 2020, 7:29am

Yes, I think this is a situation where a branch makes sense.