Hiphenated Compound Extractor 1.0 - EN-GB Example

tiagosantos · November 16, 2016, 6:51pm

This is a spreadsheet that explains how to extracts all hiphenated words from a hunspell dictionary in order for them to be used by the compound rule of each language, usualy compounds.txt used by [Language]CompoundRule.java.

@Mike_Unwalla
You may want to check it. The list is ready for your language. Cheers!

Hiphenated Compound Extractor 1.0 - EN-GB Example.ods (92.8 KB)

Mike_Unwalla · November 17, 2016, 10:37am

Thanks. Added to my task list.

Mike_Unwalla · December 14, 2016, 1:52pm

Done (https://github.com/languagetool-org/languagetool/commit/ea3c64ba3fbd46b884456c71b079370a584ca7ff). Thanks @tiagosantos.

For English, CompoundRule.java “Checks that compounds (if in the list) are not written as separate words.” The list of terms is in compounds.txt.

Many of the terms in ‘Hiphenated Compound Extractor 1.0 - EN-GB Example.ods’ can be separate. For example, ‘air-cooled’ as an adjective is correct. But, as a noun (air) and a verb (cooled), the words are separate: “After the air cooled, moisture began to form on the sides of the vessel”. I did not add such terms to compounds.txt.

compounds.txt is applicable to all varieties of English. Thus, I did not add terms such as ‘aero-engine’, which can be spelled as ‘aero engine’ in AmE (and possibly other varieties of English).

For the words to ‘cow-lick’, I often looked at the NOW corpus (English-Corpora: NOW). But, the process was very slow (the best part of 2 days). Thereafter, I did not check the terms as carefully. If I found one counter-example where the term could be used with a space, I did not add the term.

Hiphenated Compound Extractor 1.0 - EN-GB Example-mfu-comments.ods (85.5 KB)

The file ‘Hiphenated Compound Extractor 1.0 - EN-GB Example-mfu-comments.ods’ shows the terms from Column C from ‘Hiphenated Compound Extractor 1.0 - EN-GB Example.ods’, the terms that I put in compounds.txt, and a comment or a counter-example for each term that I did not put in compounds.txt.

I expect that there will be some false-positive warnings in the regression test, and I will correct those errors tomorrow.

Aside. This task took a long time, and so I am unlikely to devote more time to LT until the new year.

tiagosantos · December 14, 2016, 4:56pm

Many thanks for adding this. For a foreign speaker like me, this is a great assistance.

I know how difficult it is to triage properly the compounds, since I am also doing the same. To me it is easier, since I accept some false positives in exchange for better detections rates, so, for now, I guide my corrections by my own documents, and the regression tests.
My main principle while designing rules is that I find a false negative and a false positive equally bad, so I start with a ‘greedy’ strategy, and polish the rules recursively. Currently, I am focusing more on rules revision than on their creation.

Thank you again for the given attention. Best regards.