Grammar Checking using Naive Bayes Algorithm [further vs farther]

Hi there,

I’ve been practicing using the Naive Bayes with Weka and Python. I decided to try out the algorithm to create a comprehensive grammar checker. I’ve noticed recently that the grammar checkers of MS Word, LibreOffice, Grammerly, Ginger, et cetera, actually miss around 80-90% of grammar errors. This is amazing as it seems like a great opportunity to create the best grammar checker available.

So, I decided to use Naive Bayes to create a simple grammar checker for a single rule. I used the rules for ‘further vs farther’ because MS Word was unable to detect when these words were misused.

At first I collected word tags (e.g. NNP, VB, NN) surrounding a key word, to see if I could tap into patterns. I used a Gutenberg corpus of many books (n = 595). I didn’t get great results, so collected words from either side of a key word. I didn’t want to do this at first because it will take up a lot of space, but it worked well. Yet my grammar checker was still getting a lot of false positives, especially for ‘farther’ sentences. When I looked at the Gutenberg books, I noticed that many of the authors were misusing the words further and farther. I did a bit of research and found out that the distinction between the two words has changed in recent years and thus an old corpus will probably be less accurate. Thus, I downloaded a modern corpus of new release books (n = 7976) in the hope of getting better results. I conducted the Naive Bayes with each word being either true or false in positions 2-left, 1-left, 1-right, 2-right of the main keyword. I got 90% + accuracy for the word ‘further’ and 70%+ for the word ‘farther’. I looked at the data and noticed that some authors were misusing the word ‘farther’. For example, one author wrote: “He’s running trying to reach me but he’s getting further away”. This should be ‘farther’ as it is talking about a physical distance. Despite using a corpus of ‘published’ authors, I still find such errors. Such errors in the data distort the results a bit. I will currently try to fix this by including more books and by fixing errors as they pop up.

Here's my data format:

For the sentence: “You need look no further than our very recent history to see that it has been the Dark Jedi that have sought isolation”.

I extract this info: [look, no, than, our, further]

As you can see I extract words surrounding the key word, and I put the class [further | farther] at the end.

Here are some live example of me testing my Naive Bayes probability file:

I need to run farther than Mary.

0.334873568382 [further], 0.665126531618 [farther]

Without further issue, we must take action.

0.998243358719 [further], 0.0017566412806 [farther]

If you complain further, I’m going to shoot you out of the airlock.

0.9210218447 [further], 0.0789781552999 [farther]

Making people park a little farther away will actually increase their exposure to danger.

0.388846474458 [further], 0.611153525542 [farther]

Amazing isn’t it! As you can see, it was able to correctly correct the grammar of all of these sentences.

If this single rule grammar checker works well, I will iterate over a list of 100-200 word daichotomies to create a fully functional grammar checker that picks up on errors that no other grammar checker on the market is able to pick up onto. I’m using Python to write up my scripts, but any language could access my Naive Bayes probability tables.

I have used Weka to load in the data, but I don’t like the @attribute tags because they take up a lot of space if I have to add every word used to each tag. I’d rather just announce the words used once and reference each list for the @attribute tags. Is this possible? Otherwise, the ARFF format is going to be a problem for conducting document analysis.

Troy

1 Like

Hi Troy,

thanks for the report. Are you aware of this? Finding errors using n-gram data - LanguageTool Wiki

We have about 280 word pairs so far, further/farther isn’t among those. You can see the full list at languagetool/confusion_sets.txt at master · languagetool-org/languagetool · GitHub

Our index doesn’t contain part of speech tags yet, but I’m in the process of preparing that. The index will probably be huge (>100GB) and thus be only suitable for use on the server side.

Regards
Daniel

…and as we store our data in a Lucene index, you can maybe somehow access it from Python.

Hi Daniel,

The ‘confusion sets’ look really good. I haven’t read the n-gram page yet. I’ll take a look at that for sure.

I had more success with n-grams than word tags. However, I did see a bit of success with word tags. I’m about to experiment with a new feature where I try to categorize words as either relating to distance or an abstract concept. This may add more accuracy to my further vs farther checker. I’ll use a similar method to what was used to extract ‘positive’ and ‘negative’ words from a dictionary.

I’m also going to experiment with using a ‘synthetic thesaurus’ to check whether such could be a useful feature. I’ll create the thesaurus using a corpus for the words further and farther and use it as a feature.

(I also checked whether an n-gram of syllable counts would make a good feature. However, I briefly looked at such in Weka and it didn’t look promising. It would have been nice if it did as such would be really light-weight).

I will update my GitHub page with my code, data, and designs: https://github.com/troywatson/Python-Grammar-Checker/tree/master/fartherVsFurther

The 10 GB file sounds heavy, but I’m sure many serious writers would be happy to download it just to have a comprehensive grammar checker.

Cheers,
Troy

Hi Troy,

I just tried “further vs. farther” with the ngram data and this is the result:

precision=0.990, recall=0.470

So 99% of correct usages are found as correct, thus 1% of correct usages will lead to false alarms. 47% of incorrect usages would be detected, thus 53% would be missed. I’m running this on Wikipedia and Tatoeba data, so the input isn’t totally error-free, but these numbers should give an idea about what the ngram rule can do. If you want to reproduce this you can do so with the class ConfusionRuleEvaluator from the languagetool-dev module.

Regards
Daniel

Hi Daniel,

I have been compiling some suitable word pairs for the English n-gram list as I think it’s a great tool. I noticed your discussion above about extending the ngrams to include POS tags too. I think this would be an excellent feature to work with. Do you think the index including POS tags might be available soon?

Thanks,

Nick

Hi Nick,

I have the index here for testing, but it’s split into several parts and it cannot be merged because that would exceed Lucene’s maximum number of documents. So it cannot be used directly yet. Also, it’s 150GB large and it needs to be put on an SSD, otherwise using it is quite slow. If you want access to the index(es) anyway for running experiments, please let me know.

Regards
Daniel

Hi Nick, so you have more pairs that we don’t have covered in our confusions_set.txt yet and that work well? I’d be interested in those. Note that we also have a list of candidates which are not active yet for a number of reasons (e.g. too low precision).

Regards
Daniel

I have just tested a new method for grammar checking, using a Naive Bayes approach. I’m using Weka to do my analysis because:

  1. it’s made in Java and thus easy to add as a library for Language Tools.
  2. it’s easy to share with other members.
  3. I can tweak different setting with it with minimal effort.

I created a model for checking errors with the use of Further vs Farther. I got 100% on my handmade testing set (n = 26). Given such success, I then made a model on Bad vs Badly. I got 100% on my handmade test set for this too (n = 10). I will compile a larger test set. But I am very excited about the results so far. I will also add hundreds of extra rules like this. It shouldn’t take long because this is a really quick method.

Here are some screenshots of my experiments. Included with these screenshots are the setting that I used for my Naive Bayes to get optimal results. I can also upload the ARFF files (both badBadly and furtherFarther datasets average around 20MB each).

Hi Dnaber,

sorry I didn’t reply sooner. That sounds great. I’ve noticed that n-grams are really good for detecting this grammar rule. However, it also depends on what corpus you use. I originally used Gutenberg as a dataset and got bad results. This is because Farther Vs Further is actually a new rule. If you look through Gutenberg books you’ll notice that it is used interchangeably.

Here are some other things I’ve noticed while experimenting on this:

  1. analysing everything in lowercase increases accuracy.
  2. using a corpus to validate a model can be a problem because even published authors misuse this rule.
  3. dictionary pruning is a great way to speed up processing without affecting accuracy too much.
  4. ngrams 1-4 get great results. Anything above seems prone to over-fitting problems.
  5. Using a modern corpus of published authors is essential because grammar has changed dramatically over the past century.

Troy

Hi Troy, it would be great if you could upload those and other files needed to reproduce this. I’d like to set up Weka and give it a try.

Regards
Daniel

Here is a link to all of the files:

https://drive.google.com/open?id=0B5RQC3uUaNTPMkJ6QVVxMUJIRHM

I wanted to use GitHub but they have a 25MB limit.

You’ll need Weka to use the files. There is a great YouTube MOOC available by Ian Witten called WekaMOOC. I recommend it:

This is what I did to use the files:

  1. Open Weka and load the training set (e.g. load furtherFartherTain.arff)
  2. Go to the classify tab in Weka.
  3. Select under Bayes > NaiveBayesMultinomialText (it’s one of the only ones that will work with strings).
  4. Select “supplied test set” by clicking the button “Set”.
  5. Select the testing set (e.g. load furtherFartherTest.arff).
  6. Click on the box where naiveBayesMultimnomialText is at the top of Weka. You can load in the NaiveBayes file to use the options I used (exists in the furtherFarther folder) or you can just copy my preferences from the picture above.

Troy

Did you need to do anything special to get NaiveBayesMultinomialText? I have NaiveBayesMultinomial, but not NaiveBayesMultinomialText in Weka 3.6.10.

BTW, there’s a problem with using Weka: it’s GPL, so it cannot be combined with LT, which is LGPL. It’s okay to use it for testing and evaluation, but once we have something that’s supposed to become part of LT, it will need to be ported (maybe to encog, or DL4j).

Hmm, maybe I installed it as an extra package.

I recommend using the latest Weka. I use 3.7.13.

You can install extra packages by going to the main screen and click tools > package manager, and install the extra package.

I also have Weka 3.6.11 and it doesn’t let me install packages. So you may be better off just getting the latest version.

Weka 3.7.13 may just have the package pre-installed.

Ahhh fair enough. I guess it may be easy to port the classifier. Naive Bayes is a very simple algorithm. All we need to do is export the frequency table and that is enough to calculate the classifier. I like Naive Bayes for this reason, it’s interpretable by humans.

Alternatively, we could just save the model and load the model in Java. We may not need to use the Weka library to do this.

But I’m sure there are other programs that could do this (maybe even better). I will check out encog and DL4j. I’m happy to use any program that is accurate and convenient.

I just checked my Weka 3.6.11 and it doesn’t have a NaiveBayesMultinomialText option. But my Weka 3.7.13 has.

Thanks, using that version I have NaiveBayesMultinomialText now and I can reproduce your result. In your badBadlyTest.arff I see the string doesn’t actually contain bad/badly (with one exception, she did badly on her exam). Is that on purpose?

Cool. I try to exclude the keywords from the sentences because they will bias the classifier. It was a mistake on my part to include “badly” in the string “she did badly on her exam”. I purposely exclude the keyword from the string because I want to tap into the latent variable of what words usually sit either side of a keyword. Thus, I believe if we get the n-gram of [‘did’, 'on, ‘her’], it would be more useful than [‘did’, ‘badly’, ‘on’].

Hi Troy,
when you classifier further vs farther successful, how to decide where to make the insertion to this sentence? That is, how to decide the position to insert the further or farther?