How use tags from multiwords.txt?

Yakov · January 6, 2018, 3:37pm

For some languages like Russian, Portuguese and French multiword chunking is implemented. File multiwords.txt is used for describe multiword chunking. Phrases included in to the multiwords.txt are correctly recognized in disambiguator log. How to correctly use multiword tags from multiwords.txt in grammar.xml and disambiguation.xml?

arysin · January 7, 2018, 8:44pm

We use it in Ukrainian, at least once in grammar-grammar.xml (look for postag="<(adv|insert)>")
and a bit in Java rules. mostly for ignoring some phrases. The problem I found is that sometimes I have phrases with more than 2 words and then middle words do not any multiword tags, only first and last word do.

jaumeortola · January 8, 2018, 8:48am

I also think this is a bit annoying. Perhaps we could add some options to the multiwords tagger, for example remove all other tags in the multiword, or tag in some way the middle words.

Yakov · January 8, 2018, 4:04pm

Thanks!

arysin · January 8, 2018, 9:46pm

I’ve created Provide consistent multiword tagging for middle words · Issue #875 · languagetool-org/languagetool · GitHub and assigned it to myself.
I’ll try to tackle that when I have some free time but if somebody else can do it sooner it’s ok with me

arysin · February 8, 2018, 11:17pm

I looked into improving MultiwordChunker but I don’t know how all the languages are using this chunker so it was safer to create another class MultiworkChunker2 where I implemented new features requested:

tagging all tokens, not just first and last
allow to modify how tag is formatted (by default all tokens get )
allow to remove all other readings
I’ve also added unit tests for both chunker classes. We could merge those two together at some point if there’s agreement on this.

Jaume, please try it out and let me know how it works for you.

jaumeortola · June 6, 2018, 3:16pm

Sorry, I missed this message.

I have tried the new class MultiwordChunker2. For me there is something missing. It seems that you removed the code for dealing with contractions (two tokens that are not separated by a white space; e.g. “dels = de + els”). This was very useful to avoid errors.

Could we recover that part of the code?

It would be also preferable to avoid code duplicates and to have only one version of MultiwordChunker (with options).

arysin · October 3, 2018, 2:37am

Apologies, I had some issues with mail filtering and never see this message in my mailbox, besides I had very little free time to dedicate to LT lately.
Can you please provide some examples for how contractions are supposed to work in MultiwordChunker?
I looked briefly and found “dels” in CatalanWordTokenizerTest.java and it looks like it should be split by tokenizer so I’d assume MultiwordChunker should not care much.

tiagosantos · October 3, 2018, 3:13pm

This one also slipped by me, but I was really busy at the time.

@Yakov
I have used multiwords as a faster and simpler way of disambiguating words, and in the same fashion that it is being used in the Catalan module.
It is very verbose and each time you insert a new type of postag you have to add a rule in disambiguation, but it works.
Eg. for two word multiwords tagged with NPMS000_, two rules should be created like this:

<rule id="NOMEPROPRIO2PALAVRAS_FILTRA_MS" name="Nome proprio 2 palavras - MS">
  <pattern>
      <token postag="&lt;NPMS.+&gt;" postag_regexp="yes"/>
      <token postag="&lt;/NP.+&gt;" postag_regexp="yes"/>
  </pattern>
  <disambig action="replace"><wd pos="NPMS000_"/><wd pos="NPMS000_"/></disambig>
</rule>
<rule id="NOMEPROPRIO2PALAVRAS_FILTRA_MP" name="Nome proprio 2 palavras - MP">
  <pattern>
      <token postag="&lt;NPMP.+&gt;" postag_regexp="yes"/>
      <token postag="&lt;/NP.+&gt;" postag_regexp="yes"/>
  </pattern>
  <disambig action="replace"><wd pos="NPMP000_"/><wd pos="NPMP000_"/></disambig>
</rule>

The main caveat of this method is that you have to add more words for derived forms (eg masculine plural, feminine singular, etc.) and for phrases with more words, as for NPMP000_ phrases with three or four words.