Content of multiwords.txt file

puramoca021 · January 8, 2018, 8:26pm

What should content of file “multiword.txt” be?

If a language has cases (7 in Serbian), must I list multiwords in all seven cases so that disambiguator can disambiguate them properly? Or there is a different approach?

Thanks for help

puramoca021 · January 11, 2018, 1:59pm

Is anyone willing to help?

jaumeortola · January 11, 2018, 2:19pm

If you want to use multiwords.txt, the answer is yes, you must write 7 lines, one for each case.

Alternatively, you can write a rule in disambiguation.xml. Depending on what do you want, you could get it using one or two rules.

Give us an example, and we can try to write here these rules.

puramoca021 · January 11, 2018, 7:57pm

Thank you @jaumeortola for replying. Here is an example:

In Serbian language word „црвена“ is adjective and means “red”. Word „звезда“ is common noun and means “star”. However, together they form personal noun „Црвена Звезда“ (“Red Star”), a football club in Belgrade.

I want that „Црвена Звезда“ is properly tagged as personal noun in all seven cases of Serbian language. Hence my question: must I write in multiword.txt something like:

Црвена Звезда personal_noun_nominative_tag
Црвене Звезде personal_noun_genitive_tag
…
Црвеној Звезди personal_noun_locative_tag
?
Thanks again for help.

jaumeortola · January 11, 2018, 10:43pm

Yes, this seems to be the best solution. Using disambiguation.xml wouldn’t be a good solution.

If you need to do this frequently with a lot of expressions, I would consider writing a specialized tagger method in Java.