[GSoC] Idea - Grammar error detection using deterministic automata

Hi there, this is Changhao Liu, a sophomore majoring in Computer Science and Mathematics at The University of North Carolina at Chapel Hill(US). Last night, I tried to detect some grammar errors of some sentences in English and in Chinese and I found LT is not smart enough. As a native speaker of Chinese and fluent speaker of English, I wonder if I can make some contributions to English and Chinese grammar error detection. Here are some of my ideas.

My first idea:
Detecting grammatical errors by using some finite automata designed for a language’s grammar rules might be a good idea. If a sentence could not go to the final state, it might have some grammatical errors. For an automate, we first should have a (or several) starting state, an error state(grammar errors) and an accepting state(no grammar errors). First, we can grab different components of a sentence by the space between the components. For example, “I just had breakfast” has four components, including ‘I’,‘just’,‘had’ and ‘breakfast’. According to these words’ original order, we feed those components to the automate of English grammar rules to see whether they can be accepted. If they could be accepted(at the final state), this sentence does not have any grammar errors. However, “I just had breakfast lunch” is less likely to be accepted since it may go to an error state.
According to general English grammar rules as I can find online, I can make some automata and use them to detect grammar errors. I draw a simple indeterministic automate to illustrate part of my idea. (Double circles are accepting states and ‘trash can’ is an error state and everything starts from ‘0’ state, which is the starting state)

1.I had breakfast—Subject+Verb+Object. My automate would go 0-1-2-3(accepted–no grammar errors)
2.I just had breakfast—Subject+Adverb+Verb+Object. My automate would go 0-1-4-2-3(accepted–no grammar errors)
3. Tom Amy eat breakfast----Subject+Subject+Verb+Object. 0-1-trashState(not accepted-grammar errors)
4. I eat have breakfast—Subject+Verb+Verb+Object. 0-1-2-trashState(not accepted–grammar errors)
5…etc
As this automate is not complete nor perfect, I could improve it by putting all the grammar rules together to make it comprehensively detect grammar errors. I could also make a similar model to other languages as well if I know the rules.

Let me know if you have any further questions. Thank you.

Hi, thanks for your idea. I think there’s a problem with it though: This works fine for simple example sentences, but it easily breaks for sentences in the real world. You’d need a complete grammar before this would be useful. Until you have that, most real-world sentences will always end up in error state. Another thing is that the information that a sentence has an error is not very useful for the user - they need to know exactly where and what the error is.

Got it! Thank you for your insights.

I put some sentences with simple grammar errors in LanguageTool and it turned out it couldn’t detect them. For example, “I mom did sport.”(Two nouns without an adjunction before a verb ) “I eat drink water.”(Two verbs without an adjunction between them) If LanguageTool uses a finite state machine, it can easily detect those errors.

For your second question stating that “Another thing is that the information that a sentence has an error is not very useful for the user - they need to know exactly where and what the error is.”, it could be an easy fix. My deterministic automate is a state machine. It has a starting state, several accepting states, and an error state. According to different inputs like verbs, adjunctions, nouns, adverbs etc, it goes from current state to another state. Once one part of the sentence causes a grammar error, the automation would go to the “error state”. If the automation goes to the error state, there must be an arrow(depends on which input we put) leading the automation to that error state. We can save that input and tell the users which words cause grammar error and what type of grammar error that is. Let me show you an example in order to illustrate my idea(I am going to use the graph that I drew last night)

  1. " I eat have water." can be separated into four parts, meaning there are 4 inputs that cause my automation to move to different states. ‘I’ is a noun, which leads the automation from starting state 0 to state 1. ‘eat’ is a verb, which leads the automation from state 1 to state 2. ‘have’ is a verb, which leads the automation from state 2 to error state.Therefore, it is “have” causes the grammar error. The program would underline “have” and notify users there a verb next to another verb without a conjunction between them. Some people may question that “I eat water” is not a proper sentence, but this kind of problem is more related to word usage. My method, currently, considers only the sentence structure.

I should clarify that currently, I consider using a state machine to detect grammar errors that cause sentence structural problems, like “I mom did sports”, “I eat drink water”, “I eat dinner mom doesn’t eat dinner(no adjunction connecting them)”. Two nouns without an conjunction, two verbs without a condjunction and two sentences without a conjunction would lead the automation to the error state, meaning it successfully detects the grammar error. Also, the automaton I drew is not complete, it doesn’t include all the potential combinations, but I can make a complete one. First, we should have a starting state, the number of inputs plus an error states decides how many other other states we should have

noun n. student
pronoun pron. you
adjective adj. happy
adverb adv. quickly
verb v. cut
numeral num. three
article art. a
preposition prep. at
conjunction conj. and

We need 10 other states(some of them could be accepting states), including error state. We then draw 10 arrows from starting stating state to other states.
As for state names, we can call them “Verb State”, “Noun State”, etc. For each state, there are another 9 arrows pointing to other states.According to the grammar rules, we can let some inputs pointing to error states according to the current state. For example, ‘I’+‘eat’+‘breakfast’ ‘I’ leads the automate to “Noun State”. Before input ‘eat’, our current state is “Noun State”. Then Inputting ‘eat’ leads the automate to “Verb State”. Then inputting ‘breakfast’ leads to “Noun State”, which is accepting state. No grammar errors. As for the first two components of “I eat drink breakfast”, it will go from current state—“Noun State”----“Verb State”, but for the third input “drink”, it will lead the automate from “Verb State” to error state since it violates one of the grammar rules.
I think you meant if the sentences get longer, my automation will easily fail. I should argue that it is us who, according to the grammar rules, to build the automate. If there is no error violating the error, it will not go to the error state. If our sentences follow the rules, all the inputs will go around to different states except the “Error State”. I just need to complete the automate and trial error to perfect it. e.g. “This morning I just had an omelet for breakfast and I really enjoyed it”(it will just go around among several states and finally be accepted) VS “This morning I just had an omelet for breakfast I really enjoyed it”(‘I’ would lead the automate to an error state because of two nouns )
Can you collaborate on “most real-world sentences will always end up in error state” or give me some examples that make you concerned? Let me know if you have any questions. Thank you for your time.

Here are some resourses that might be helpful:

However, “drink” can also be a noun, so you cannot be sure that “drink water” isn’t a noun phrase and the sentence says that someone eats “drink water”, which isn’t a grammatical error. As you have a lot of ambiguity in English, this will happen a lot.

I see, I hadn’t considered that. So when a sentence doesn’t parse at all, we’d just assume it’s too complex, but not consider it incorrect, right?

Thank you so much for your feedback. As for ambigiouty, Markov Chains might be used to solve that problem, but I haven’t considered how to exactly do it.

I feel my first goal is to detect some general grammar errors in some sentences and notify users. After that, I am going to run a large number of cases to see whether it works or not. As for some edge cases and special situations, I could improve the system by writing codes to take consideration of those situations.

For this one, I am not sure about your question. Can you explain that again, please and I will give you an explanation? Thank you!

I am not sure when you said: “a sentence doesn’t parse at all” and “it’s too complex.” Can you elaborate on those?
If my understanding is right, I assumed you meant long sentences with some complex structures, not just the simple one like “I just had breakfast”. What makes a sentence more complex and longer? There could be some adjectives, adjunctions and etc, but those inputs will still lead the automate to different states, including accepting states, intermittent states or accepting states according to the grammar rules. My automation can include all the grammar rules, if one part of a long and complex sentence doesn’t follow the grammar rule, it will lead the automate to the error state. As for a longer sentence like “On a lovely sunny Saturday morning, I not only made a delicious breakfast for my sleepy mom but also cleaned the messy dining room”, it will go from the starting states to other states and then finally accepted, since it doesn’t violate any grammar rules(if my sentence is right). If it does violate any grammar rules that are programmed in our automation, the automation would go to the error state and tell the user about which potential input leads to a grammar error and what potential error it could be.

If the automation is designed well and each word is categorized well, a sentence like "Meeting the goal of 5,000 Model 3s a week by the end of June is critical to generating enough cash to sustain operations without having to raise more capital(WSJ)) could be accepted.

Also, I read some academic resources about detecting grammar errors in English or other languages. Do you think that Parse Trees and Context-free Grammars could be used for LanguageTool?

http://delivery.acm.org/10.1145/2600000/2591440/5020a153.pdf?ip=152.23.233.37&id=2591440&acc=ACTIVE%20SERVICE&key=AA86BE8B6928DDC7.B2ED415011FB783D.4D4702B0C3E38B35.4D4702B0C3E38B35&__acm__=1521174532_31919fcb2372f9ab1df9931c30b7d480

http://www.cs.uccs.edu/~jkalita/work/cs589/2010/12Grammars.pdf
http://www.hlt.utdallas.edu/~sanda/courses/NLP/Lecture08.pdf

But then we’re back to: until your grammar is 100% correct and complete, people will get error messages for sentences not because the sentence is incorrect, but because your grammar is not complete yet.

Yea, I can start with a basic automation then add more details. I wonder how LT detects grammar errors in English just by checking the rules that written in XML?