For some languages like Russian, Portuguese and French multiword chunking is implemented. File multiwords.txt is used for describe multiword chunking. Phrases included in to the multiwords.txt are correctly recognized in disambiguator log. How to correctly use multiword tags from multiwords.txt in grammar.xml and disambiguation.xml?
We use it in Ukrainian, at least once in grammar-grammar.xml (look for
and a bit in Java rules. mostly for ignoring some phrases. The problem I found is that sometimes I have phrases with more than 2 words and then middle words do not any multiword tags, only first and last word do.
I also think this is a bit annoying. Perhaps we could add some options to the multiwords tagger, for example remove all other tags in the multiword, or tag in some way the middle words.
I’ve created https://github.com/languagetool-org/languagetool/issues/875 and assigned it to myself.
I’ll try to tackle that when I have some free time but if somebody else can do it sooner it’s ok with me
I looked into improving MultiwordChunker but I don’t know how all the languages are using this chunker so it was safer to create another class MultiworkChunker2 where I implemented new features requested:
- tagging all tokens, not just first and last
- allow to modify how tag is formatted (by default all tokens get )
- allow to remove all other readings
I’ve also added unit tests for both chunker classes. We could merge those two together at some point if there’s agreement on this.
Jaume, please try it out and let me know how it works for you.
Sorry, I missed this message.
I have tried the new class MultiwordChunker2. For me there is something missing. It seems that you removed the code for dealing with contractions (two tokens that are not separated by a white space; e.g. “dels = de + els”). This was very useful to avoid errors.
Could we recover that part of the code?
It would be also preferable to avoid code duplicates and to have only one version of MultiwordChunker (with options).