Extra tags for AnalyzedToken

I am feeling we could improve analyzed token by allowing more tags which are not POS tags.
I am currently putting some tags in POS tag dictionary that are of additional help but are not pure POS tags. If we had extra tags we could put such information there.

The idea is to add a field AnalyzedToken.extraTags, e.g.:
private Map<String,String> extraTags;
The key here is category and value is set of tags. E.g. “pre-disambig: noun:m:…” or "
This could be used in several ways:

  1. disambiguator could put original (pre-disambig) tags there (if needed)
  2. tagger could put additional tags so that disambig/rules could use it later
    Tagger could use that field for some dynamic properties (e.g. token was tagged dynamically, without the dictionary) or users can create additional dictionaries.

It could be then used in the rules with something like this:
<token postag="..." extra_tag="category1:tag1">

It might also be useful for debugging e.g. the German compound splitter.

I second this! Could be very useful for all kind of word markings!

I think we could add a field and start using it in Java code first, and then later extend it to grammar.xml/disambiguation.xml

The question is what type of field should it be:1

  1. plain string String extraTag - this has benefit of being similar to posTag and using same mechanisms
  2. Set<String> - (as unlike posTag these tag may be not related at all) this has benefit of being able to handle tags in more independent ways
  3. Map<String,Set<String>> - category to set of tags - this allows to separate tag sets into different categories, e.g. “dynamicTagging”, “disambig”, “semantic” etc will all have different set of tags

3rd IMHO is most properly designed and scalable but may impose bigger changes to xml handling.

I agree. One could think about using Map<String,List<String>> so there’s order information in the values.

Good suggestion. Shall we create a branch for this or just add a field in master (maybe marked experimental) and start working with it?

Yes, I think this is a situation where a branch makes sense.