Is there current N-gram data?

anilx · October 1, 2023, 8:00am

I am using LanguageTool version 6.1 along with an 8GB n-gram dataset, and I’m running the LanguageTool server via Docker.

However, it appears that the n-gram dataset is outdated, as it fails to detect certain grammatical errors. For instance:

On my personal server, it does not suggest any corrections for the sentence: “I have got pencil,” even though there is a grammatical mistake.
However, when using languagetool.org, it correctly identifies the error with a message suggesting to change “got → (a)” due to a grammar mistake.

I have encountered several sentences with similar issues during my testing. Could you please guide me on how to obtain and install the most up-to-date n-gram dataset?

Thank you.

dnaber · October 1, 2023, 9:44am

This is not related to the ngram data. The match you can currently see on languagetool.org is a rule we’re testing that’s not open source.

anilx · October 1, 2023, 10:20am

Thank you for the feedback.

I attempted to create an XML rule, but it failed to identify certain sentences. Is there a more comprehensive documentation available for this?

As an example, I made an effort to craft a rule to identify incorrect usage of “having” and “have,” but it proved ineffective.

anilx · October 1, 2023, 10:32am

The XML rules I write are generally sentence-based. What I want is to catch “subject verb agreement” errors.

For example;

The sofas is comfortable.
The sofas are comfortable.

She haven’t any tables in her living room.
She doesn’t have any tables in her living room.

The sentences are given as examples, but some of these sentences are not captured, so I have no idea how to further expand the rule.

dnaber · October 1, 2023, 1:00pm

The docs are at https://dev.languagetool.org/, especially Development Overview | dev.languagetool.org.