Grammar Checking using Naive Bayes Algorithm [further vs farther]

Hi Daniel,

The ‘confusion sets’ look really good. I haven’t read the n-gram page yet. I’ll take a look at that for sure.

I had more success with n-grams than word tags. However, I did see a bit of success with word tags. I’m about to experiment with a new feature where I try to categorize words as either relating to distance or an abstract concept. This may add more accuracy to my further vs farther checker. I’ll use a similar method to what was used to extract ‘positive’ and ‘negative’ words from a dictionary.

I’m also going to experiment with using a ‘synthetic thesaurus’ to check whether such could be a useful feature. I’ll create the thesaurus using a corpus for the words further and farther and use it as a feature.

(I also checked whether an n-gram of syllable counts would make a good feature. However, I briefly looked at such in Weka and it didn’t look promising. It would have been nice if it did as such would be really light-weight).

I will update my GitHub page with my code, data, and designs: https://github.com/troywatson/Python-Grammar-Checker/tree/master/fartherVsFurther

The 10 GB file sounds heavy, but I’m sure many serious writers would be happy to download it just to have a comprehensive grammar checker.

Cheers,
Troy

Hi Troy,

I just tried “further vs. farther” with the ngram data and this is the result:

precision=0.990, recall=0.470

So 99% of correct usages are found as correct, thus 1% of correct usages will lead to false alarms. 47% of incorrect usages would be detected, thus 53% would be missed. I’m running this on Wikipedia and Tatoeba data, so the input isn’t totally error-free, but these numbers should give an idea about what the ngram rule can do. If you want to reproduce this you can do so with the class ConfusionRuleEvaluator from the languagetool-dev module.

Regards
Daniel

Hi Daniel,

I have been compiling some suitable word pairs for the English n-gram list as I think it’s a great tool. I noticed your discussion above about extending the ngrams to include POS tags too. I think this would be an excellent feature to work with. Do you think the index including POS tags might be available soon?

Thanks,

Nick

Hi Nick,

I have the index here for testing, but it’s split into several parts and it cannot be merged because that would exceed Lucene’s maximum number of documents. So it cannot be used directly yet. Also, it’s 150GB large and it needs to be put on an SSD, otherwise using it is quite slow. If you want access to the index(es) anyway for running experiments, please let me know.

Regards
Daniel

Hi Nick, so you have more pairs that we don’t have covered in our confusions_set.txt yet and that work well? I’d be interested in those. Note that we also have a list of candidates which are not active yet for a number of reasons (e.g. too low precision).

Regards
Daniel

I have just tested a new method for grammar checking, using a Naive Bayes approach. I’m using Weka to do my analysis because:

  1. it’s made in Java and thus easy to add as a library for Language Tools.
  2. it’s easy to share with other members.
  3. I can tweak different setting with it with minimal effort.

I created a model for checking errors with the use of Further vs Farther. I got 100% on my handmade testing set (n = 26). Given such success, I then made a model on Bad vs Badly. I got 100% on my handmade test set for this too (n = 10). I will compile a larger test set. But I am very excited about the results so far. I will also add hundreds of extra rules like this. It shouldn’t take long because this is a really quick method.

Here are some screenshots of my experiments. Included with these screenshots are the setting that I used for my Naive Bayes to get optimal results. I can also upload the ARFF files (both badBadly and furtherFarther datasets average around 20MB each).

Hi Dnaber,

sorry I didn’t reply sooner. That sounds great. I’ve noticed that n-grams are really good for detecting this grammar rule. However, it also depends on what corpus you use. I originally used Gutenberg as a dataset and got bad results. This is because Farther Vs Further is actually a new rule. If you look through Gutenberg books you’ll notice that it is used interchangeably.

Here are some other things I’ve noticed while experimenting on this:

  1. analysing everything in lowercase increases accuracy.
  2. using a corpus to validate a model can be a problem because even published authors misuse this rule.
  3. dictionary pruning is a great way to speed up processing without affecting accuracy too much.
  4. ngrams 1-4 get great results. Anything above seems prone to over-fitting problems.
  5. Using a modern corpus of published authors is essential because grammar has changed dramatically over the past century.

Troy

Hi Troy, it would be great if you could upload those and other files needed to reproduce this. I’d like to set up Weka and give it a try.

Regards
Daniel

Here is a link to all of the files:

https://drive.google.com/open?id=0B5RQC3uUaNTPMkJ6QVVxMUJIRHM

I wanted to use GitHub but they have a 25MB limit.

You’ll need Weka to use the files. There is a great YouTube MOOC available by Ian Witten called WekaMOOC. I recommend it:

This is what I did to use the files:

  1. Open Weka and load the training set (e.g. load furtherFartherTain.arff)
  2. Go to the classify tab in Weka.
  3. Select under Bayes > NaiveBayesMultinomialText (it’s one of the only ones that will work with strings).
  4. Select “supplied test set” by clicking the button “Set”.
  5. Select the testing set (e.g. load furtherFartherTest.arff).
  6. Click on the box where naiveBayesMultimnomialText is at the top of Weka. You can load in the NaiveBayes file to use the options I used (exists in the furtherFarther folder) or you can just copy my preferences from the picture above.

Troy

Did you need to do anything special to get NaiveBayesMultinomialText? I have NaiveBayesMultinomial, but not NaiveBayesMultinomialText in Weka 3.6.10.

BTW, there’s a problem with using Weka: it’s GPL, so it cannot be combined with LT, which is LGPL. It’s okay to use it for testing and evaluation, but once we have something that’s supposed to become part of LT, it will need to be ported (maybe to encog, or DL4j).

Hmm, maybe I installed it as an extra package.

I recommend using the latest Weka. I use 3.7.13.

You can install extra packages by going to the main screen and click tools > package manager, and install the extra package.

I also have Weka 3.6.11 and it doesn’t let me install packages. So you may be better off just getting the latest version.

Weka 3.7.13 may just have the package pre-installed.

Ahhh fair enough. I guess it may be easy to port the classifier. Naive Bayes is a very simple algorithm. All we need to do is export the frequency table and that is enough to calculate the classifier. I like Naive Bayes for this reason, it’s interpretable by humans.

Alternatively, we could just save the model and load the model in Java. We may not need to use the Weka library to do this.

But I’m sure there are other programs that could do this (maybe even better). I will check out encog and DL4j. I’m happy to use any program that is accurate and convenient.

I just checked my Weka 3.6.11 and it doesn’t have a NaiveBayesMultinomialText option. But my Weka 3.7.13 has.

Thanks, using that version I have NaiveBayesMultinomialText now and I can reproduce your result. In your badBadlyTest.arff I see the string doesn’t actually contain bad/badly (with one exception, she did badly on her exam). Is that on purpose?

Cool. I try to exclude the keywords from the sentences because they will bias the classifier. It was a mistake on my part to include “badly” in the string “she did badly on her exam”. I purposely exclude the keyword from the string because I want to tap into the latent variable of what words usually sit either side of a keyword. Thus, I believe if we get the n-gram of [‘did’, 'on, ‘her’], it would be more useful than [‘did’, ‘badly’, ‘on’].

Hi Troy,
when you classifier further vs farther successful, how to decide where to make the insertion to this sentence? That is, how to decide the position to insert the further or farther?

Hi Mility,

I would first go through a document and only apply a model if there is a keyword in a sentence (e.g. there needs to be the word farther or further to use the farther-further model). Then I test the sentence with the model and if the model suggests I use farther when I wrote further, a program will underline the word with red to indicate it is wrong. However, if I wrote farther and the model suggests farther, the program will not do anything.

I haven’t coded any of this for Language Tools yet but am just working on building models for individual rules. This way the models can be used for LanguageTools or could be used by anyone to create their own grammar checker.

btw, I’m planning on using this format to create a bigotry spelling checker. The model will underline sexist, racist, and homophobic sentences. I reckon it’ll be very easy to build.

Hi Troy,
Thanks your explain.
Maybe my explain is not clear.
For example as the sentence below:(the data from your test data)


  "he didn't want to talk about it any", further

Because we know that this sentence should use further, how to decide the position to insert the further?
further may at any position in this sentence. Such as:


 he further didn't want to talk about it any
 he didn't further want to talk about it any
 he didn't want further to talk about it any
 ....

How to determine which of the above sentence is what we want, this is a problem?

Ahh, yeah that could be a problem. I guess this could be fixed by logging the position of the keyword before putting it through the model. So a program could first (1) look through the sentences for a keyword, (2) if it finds a keyword, split up the sentence into words and log the position of the keyword, (3) put the words through the model without the keyword, and (4) find out whether the keyword is an error or not.

However, it may be difficult if there are 2 keywords in a sentence. Such as:

how can we further run farther at the marathon?

This could be another problem.