"Cooking" the n-grams

MickleMouse · November 20, 2015, 8:54pm

Hello all,

First, thank you for the great tool! Second, I apologize for my lack of knowledge of LT’s inner workings. This suggestion may actually already be used.

I have an idea that will allow LT to offer a smaller n-gram dataset, at the expense of user flexibility. I think LT ought to offer a “cooked” n-gram database, where the cooked database only contains n-grams with at least one of the words of confusion. This reduced dataset should be significantly smaller. This may even allow the “cooked” database to use 4-grams, should the savings be substantial enough. The trade-off though, is that that the user can no longer effectively add new words of confusion.

I think in addition to the default n-gram database that can be downloaded, LT ought to offer a “cooked” n-gram database. This allows people developing new rules to continue to use the large uncooked dataset. Everyone else who simply wants a secure way to use the n-gram capabilities would only need the smaller “cooked” database, and those who do not need a secure means may simply use the form on the LT website.

dnaber · November 20, 2015, 9:10pm

Hi,

thanks for your idea. I thought about this before, but I think the ngram database would still be so large that we cannot include it in the default download. Also, we often add confusion words, so we’d need to re-build the database for each release. Then if someone uses on old database with a new release, the new confusions words wouldn’t be found. All in all, I think it’s not worth the effort.

Regards
Daniel

MickleMouse · November 20, 2015, 9:43pm

I kind of thought it would still be problematic. I was just trying to save TL some bandwidth. I never thought it should be included by default, again to save bandwidth. Personally, I would have downloaded the smaller (maybe 200MB) file instead of the 8GB file.

However, the problems you brought up can easily be solved. The cooked databases would include the list of words of confusion. If the actual list of words of confusion is anything other than a proper subset of the cooked database, alert the user. Maybe even offer patches to update between versions of cooked n-gram databases. Then depending on the bandwidth saved, it might be worth it to rebuild the database with each release, especially if it just requires a script and some computer time.

Anyways, thanks for thinking it over and responding. If it isn’t worth it, then I’ll just drop it here. Best regards.