Did anyone ever try to use the ngrams to check how (e.g.) ‘scot free’ rates in frequency compared with scot-free or scotfree?
In English words are mostly written apart, so the few deviations are rather easy to detect. But what about the compounding German en Dutch? In Dutch compounding errors are very common (and sometimes hard to decide if they are right or wrong.
Examples for Dutch:
- Ik verkoop heren schoenen : I an selling shoes to gents.
- Ik verkoop heren( )schoenen : I am selling gents’ shoes.
In the last case the space is wrong.
The statistics of the ngrams could help to decide if it is statistically wise to trigger a report for the incorrect space in the compound. The compounding rule is great, but could generate a lot of false positives when used for all compounds. Adding the ngrams as a validation (when available) could improve the compounding rule. Possibly without even having to specify the compounds!
What about this idea?