[en] Multiwords.txt for English

Mike_Unwalla · November 12, 2019, 8:36am

Multiwords.txt seems to be used to apply a postag to a multi-word term. Many languages use multiwords.txt. English does not have multiwords.txt.

Is the purpose of multiwords.txt to apply a postag to a multi-word term?
Why doesn’t English have multiwords.txt?
If there is no good reason, could one of the devs please make multiwords.txt available for English?

dnaber · November 12, 2019, 4:25pm

I’ve added and activated en/multiwords.txt but not tested it, please give it a try.

Mike_Unwalla · November 13, 2019, 9:24am

@dnaber, thank you.

Initial tests show that the file/rule does exactly what I want.

Now I have a related question to you and the team about disambiguation:

My test term is The Fat Cat.

The multiword chunker removes DT from The and all the readings except NNP from Cat. I guess it also removes readings from Fat, but I cannot see that.

Disambiguation JJ_NN_JJ applies JJ to Fat. For test purposes, I thought to put an antipattern on the rule to prevent the rule from changing a multi-word proper noun. But, the rule has this example:
<example type="ambiguous" inputform="Canadian[Canadian/JJ,Canadian/NNP]" outputform="Canadian[Canadian/JJ]">The <marker>Canadian</marker> Badlands is nice.</example>

As it happens, Badlands is NNP, thus an exception for a sequence of 2 proper nouns would mean that in this context, Canadian is not disambiguated as JJ. I think that it should not be JJ, but rather, the correct disambiguation is to make Canadian Badlands a proper noun. (Rule NNP_NNS_VBZ_NNP applies NNP to Badlands.)

If a term is a multi-word proper noun, should a disambiguator rule that changes only a single token change the NNP postag?

I don’t plan to change any rules at this stage, but I would like comments/suggestions from the team about what the correct analysis is.

Mike_Unwalla · November 13, 2019, 9:46am

@dnaber,

I added this line:
Taj Mahal NNP # my comment

Neither testrules nor the Maven tests give a warning about the comment. Thus, I guess that multiwords.txt can safely contain a comment on the same line as the term. Can you please confirm that it is OK to add a comment on the same line?

dnaber · November 13, 2019, 9:53am

Looking at the code, a comment at the end of the line would not work. If you need that, let me know.

Mike_Unwalla · November 13, 2019, 9:56am

The ability to add a comment on the same line would be really nice. I could then very easily sort lines alphabetically.

dnaber · November 13, 2019, 12:11pm

This is implemented now.

Mike_Unwalla · November 13, 2019, 3:26pm

Great, thank you. I will start to move NNP data from disambiguation to multiwords.txt tomorrow.

Mike_Unwalla · November 14, 2019, 4:03pm

@dnaber, I removed Yom Kippur (NNP) from disambiguation.xml and added it to multiwords.txt. This is the tagger result:

In the Token column, Yom and Kippur do not have postags NNP.

Is this correct?
If yes, how do I use the NNP postag that MULTIWORD_CHUNKER shows in the Disambiguator Log column?

dnaber · November 14, 2019, 4:46pm

I don’t know, I have just activated multiwords.txt the same way (I think) it works for other languages. Maybe @jaumeortola can comment - ca/mulitwords.txt seems to be well-maintained.

jaumeortola · November 21, 2019, 9:47pm

Sorry for the delayed response.

I don’t see all the expected tags in your results.

On the command line, I get:

<S> Yom[Yom Kippur/<NNP>,B-NP-singular] Kippur[</S>Yom Kippur/</NNP>,E-NP-singular]<P/> 
Disambiguator log: 
MULTIWORD_CHUNKER: Yom[Yom/null*,B-NP-singular] -> Yom[Yom Kippur/<NNP>*]
MULTIWORD_CHUNKER: Kippur[Kippur/SENT_END,E-NP-singular] -> Kippur[Kippur/SENT_END,Yom Kippur/</NNP>]

In a multiword of 3 or more tokens you will get tags only in the first token and the last one.

To match these tags you need patterns like <token postag=".*NNP.*" postag_regexp="yes"/> or similar.

We could add options to the MultiWordChunker to get other tags. Instead of <NNP> </NNP>, you could get just the tags NNP NNP, even in the in-between tokens in multiwords of 3 or more tokens. We could also remove optionally all other tags (existing for individual tokens if they are in the tagger dictionary). These options are not yet implemented.

An existing useful option is to ignore spelling errors in all tagged words. That’s one of the main functions of the multiword list in Catalan.

Mike_Unwalla · November 22, 2019, 9:21am

@jaumeortola, thanks.

Your screen shot of the command line shows the result that I expect.

So, I think that the Tagger Result shows incorrect information. If you agree, let me know, and I will open an issue (or you can do it).

My idea was to use multiwords.txt to apply NNP postags to multi-word proper nouns, because the current method of adding the proper nouns to disambiguation is cumbersome. (Also, currently, the rules are at the end of disambiguation.xml. Probably, if they remain in disambiguation, they should be at the top of the document. The postags for 1-word terms are available to the disambiguator, but postags for multi-word terms are not available.)

jaumeortola · November 23, 2019, 9:02am

With these results you will need some extra rules for ignoring spelling. Something like this (not tested):

<rule>
    <pattern>
        <token postag="&lt;NNP&gt;"/>
        <token min="0" max="5"/>
        <token postag="&lt;/NNP&gt;"/>
    </pattern>
    <disambig action="ignore_spelling"/>
</rule>

Then is it just a visualization issue in the stand-alone program?

Mike_Unwalla · November 25, 2019, 8:25am

@jaumeortola, thanks for the example rule.

I guess so. I don’t know; I have not done tests.

Mike_Unwalla · June 9, 2020, 9:56am

Bug. Refer to [en] multiwords.txt does not work correctly · Issue #3044 · languagetool-org/languagetool · GitHub.

Ruud_Baars · June 9, 2020, 1:16pm

How do you achieve that? Multiwords can also be added to spelling.txt, but that seems to have a bad influence on performance if the list is too long. How about this solution for a very long list?

jaumeortola · June 9, 2020, 8:02pm

Do you mean “ignore all tagged words”? It is an option of the MorfologikSpellerRule. In Dutch, it is disabled now. If you want, I can enable it for you.

This issue about better spelling suggestions is somewhat related, but it is not the same thing.

Ruud_Baars · June 10, 2020, 7:00am

It is only of use when there is no significant performance effect when adding wordgroups to multiwords.txt, and since that is a text file, like spelling.txt, I am not to sure about that.
And I am not sure about ignoring all tagged words either.
So thanks, but no, thanks for now.