Ngrams for compounds/space issues

Did anyone ever try to use the ngrams to check how (e.g.) ‘scot free’ rates in frequency compared with scot-free or scotfree?
In English words are mostly written apart, so the few deviations are rather easy to detect. But what about the compounding German en Dutch? In Dutch compounding errors are very common (and sometimes hard to decide if they are right or wrong.
Examples for Dutch:

  • Ik verkoop heren schoenen : I an selling shoes to gents.
  • Ik verkoop heren( )schoenen : I am selling gents’ shoes.
    In the last case the space is wrong.

The statistics of the ngrams could help to decide if it is statistically wise to trigger a report for the incorrect space in the compound. The compounding rule is great, but could generate a lot of false positives when used for all compounds. Adding the ngrams as a validation (when available) could improve the compounding rule. Possibly without even having to specify the compounds!

What about this idea?

It’s a good idea, maybe you can try with several pairs? With the dev JAR, the call should look like this (not tested yet):

java -cp languagetool-dev-4.5-SNAPSHOT-shaded.jar org.languagetool.dev.bigdata.NGramLookup "heren schoenen" /path/to/index

vs.

java -cp languagetool-dev-4.5-SNAPSHOT-shaded.jar org.languagetool.dev.bigdata.NGramLookup "herenschoenen" /path/to/index

[heren, schoenen] -> count:91, 1.881152117585028E-7, coverage=1.0, log:-15.486211233316492
totalP=1.881152117585028E-7
[herenschoenen] -> count:656, 1.3433879796232213E-6, coverage=1.0, log:-13.520315791880922
totalP=1.3433879796232213E-6
[heren-schoenen] -> count:0, 2.044730562592422E-9, coverage=0.0, log:-20.007999810365533
totalP=2.044730562592422E-9

[scheeps, arts] -> count:0, 2.044730562592422E-9, coverage=0.0, log:-20.007999810365533
totalP=2.044730562592422E-9
[scheepsarts] -> count:317, 6.502243189043902E-7, coverage=1.0, log:-14.245948427585356
totalP=6.502243189043902E-7
[scheeps-arts] -> count:0, 2.044730562592422E-9, coverage=0.0, log:-20.007999810365533
totalP=2.044730562592422E-9

[pizza, oven] -> count:136, 2.801280870751618E-7, coverage=1.0, log:-15.088018884537407
totalP=2.801280870751618E-7
[pizzaoven] -> count:285, 5.847929409014327E-7, coverage=1.0, log:-14.352007999545679
totalP=5.847929409014327E-7
[pizza-oven] -> count:144, 2.964859315759012E-7, coverage=1.0, log:-15.031266067944957
totalP=2.964859315759012E-7

(For Dutch, an optional - is always allowed on compounding border when it is more readable. The ratio with/without dash would be a good trigger for a different warning ‘The optional hyphen is not commonly used in this word.’)

If useful, I could pass all words in compounds.txt through this routine…

I processed a large list of compounds. results have been added. 0 does not actually alway mean zero, sinde we thresholded compound_analyse.ods (69.3 KB)
the ngram counts