Common db of examples between LT and grammalecte - regression testing

artofit · February 4, 2014, 11:42am

I would be happy to contribute to English and French grammar checker to first populate a database (unicode csv file for instance) of errors and fix sentences, as English Error Collection - LanguageTool Wiki

I just discovered both LT and Grammalecte(http://www.dicollecte.org/grammalecte/) and if I do understand that developpers have different views on how to implement matters(just as for instance gcc and clang), there is still one common objectif : check the grammatical rules of a given language.

So is there a common database / repository of both project?

I think it would also be very useful for regression testing to measure differences between versions of false positive, true positive and unmatched.

DB could be composed as such:
error sentence
correct sentence
source_language : e.g. fr_ch, fr, etc.
linguistist verified flag : pending , true, rejected
LT_version_problem: nill=OK, version nb
LT_commitfix
Grammalecte_version_problem
Grammalecte_commitfix
reference_links

Example1: LT misses, Grammalecte finds
une zone géographique dans laquelle je vie
une zone géographique dans laquelle je vis
fr
pending
web_2014_02_04
new
nill
nill

Example2: LT finds, Grammalecte misses
Suite à notre entretient téléphonique
Suite à notre entretien téléphonique
fr
pending
nill
nill
0.3.7
pending

Thanks.

dnaber · February 4, 2014, 1:44pm

There’s no common database, but LT has these examples in its XML file:

https://raw2.github.com/languagetool-org/languagetool/master/languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml

So what you can do already is to extract the incorrect sentences from that file and see if Grammalecte can find the errors, too.

artofit · February 4, 2014, 2:51pm

what you can do already is to extract the incorrect sentences from that file and see if Grammalecte can find the errors, too.

1/ I started a conversation with Grammalecte.
Presently Grammalecte depends on LO (or Apache OO), viz no autonomous ui or cli
=> all tests are manual
The developper is working on automating this.
re: http://www.dicollecte.org/thread.php?prj=fr&t=387

2/ As the Grammalecte dev as a test file wouldn’t be more meanful to test his?

3/ “you can do already is to extract the incorrect sentences”
I’m ready to give it a go, I suppose you do not have a parser? If no, I can try with Perl’s regexp

4/ How do you want to exchange/share info & files? On the forum?

dnaber · February 4, 2014, 7:11pm

Thanks for taking care of this.

3/ You could use the Java API if you’re a Java developer. Or alternatively, an XSLT transformation, or, as your said, a regex.

4/ You can post here, but for any advanced uses or development of LT, the mailing list if a better place (languagetool-devel List Signup and Options)

I didn’t quite get your second point…

Dominique_PELLE · February 4, 2014, 9:16pm

[quote=“artofit”]
I would be happy to contribute to English and French grammar checker to first populate a database (unicode csv file for instance) of errors and fix sentences, as English Error Collection - LanguageTool Wiki

Excellent ! I welcome help on the French version of LanguageTool, especially since I have less time now to contribute.

[quote=“artofit”]I just discovered both LT and Grammalecte(http://www.dicollecte.org/grammalecte/) and if I do understand that developpers have different views on how to implement matters(just as for instance gcc and clang), there is still one common objectif : check the grammatical rules of a given language.

So is there a common database / repository of both project?[/quote]

LT has tests but Grammalecte does not yet, as explained here http://www.dicollecte.org/thread.php?prj=fr&t=387.
LT and grammalecte are different, but there is a healthy cooperation. LT uses the dictionary from Dicollecte (transformed for LT). Both checkers borrowed some rule ideas from each other.

I think it would also be very useful for regression testing to measure differences between versions of false positive, true positive and unmatched.[/quote]

That can certainly be useful.

[quote=“artofit”]3/ “you can do already is to extract the incorrect sentences”
I’m ready to give it a go, I suppose you do not have a parser? If no, I can try with Perl’s regexp[/quote]

A simple grep in the French grammar.xml will give you all the examples :

$ grep ‘<example’ ./languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml
…snip…
Il était sensé l’accompagner.
Il était censé l’accompagner.
Jimmy Hendrix est né à Seattle.
Jimi Hendrix est né à Seattle.
…snip…

However, be aware that the example are for a single rule only. In fact some “correct” example may trigger errors in other rules. That’s not desirable but it can happen. For example, I see such an example to test rule DESSINER_UN_DESSIN:

  <example type="incorrect"><marker>dessiner un dessin</marker></example>
  <example type="correct"><marker>faire un dessin</marker></example>

This example is is good enough to test rule DESSINER_UN_DESSIN. However, it would trigger an error if we used it to test all other rules (missing uppercase at the beginning of the sentence). Ideally all examples should not trigger errors in any of the rules, but that is unfortunately not the case in many old rules such as the above example. It would be nice to improve those examples so that they are correct when checking all rules.

4/ How do you want to exchange/share info & files? On the forum?[/quote]
Here, or in the mailing list. I personally prefer the mailing list, but feel free to do it how you prefer.

Dominique_PELLE · February 4, 2014, 9:33pm

Which version of LT did you use?
The latest version finds errors in both sentences that are incorrect and does not give any error in correct above sentences.
See this screenshot of LT running inside Vim with your sentences:

artofit · February 4, 2014, 9:41pm

I didn’t quite get your second point…
The grammalecte developper uses a test file to test grammalecte
I was suggesting to throw it at LT and see the true positive, the false and overlooked.

3/ You could use the Java API if you’re a Java developer. Or alternatively, an XSLT transformation, or, as your said, a regex.
First, I’ll “clean” the XML through regexp’s.
Java, last time I used it, was to signal bugs in compiler version 1.1 or 1.0, a bit of history …

At Dominique (si je peux/ puis )
The web version (https://www.languagetool.org/) as of today 2014-02-04

dnaber · February 4, 2014, 9:55pm

In this case you need the dot at the end of the sentence to trigger the rule.

Dominique_PELLE · February 4, 2014, 9:57pm

I see. I had put a dot at the end of the sentence and you did not.
With a dot or another word following “vie”, LT correctly gives the error.
I can easily improve this rule to detect the error to detect the error even without the extra word following vie.
I’ll do that tomorrow.

artofit · February 4, 2014, 10:07pm

Jesus, you’re right.

But that’s odd, it finds the missing plural without the dot, in the followings:
LES BASES THÉORIQUE EN ANALYSE DU DISCOURS
Les camions étaient bleu

By the way in the following, doesn’t notice that noun is female while article is masculine
Le discours, essai d’un définition.

I’ll do that tomorrow.
Nice, that’s fast.

Dominique_PELLE · February 4, 2014, 10:59pm

That’s specific to a few rules. It’s because the last token uses…

In other words, it has an exception when the last token has a POS which
is not a noun. But the last token has an extra POS for end of sentence (SENT_END).
SENT_END does not match N.* regexp, so the rule does not match. I can easily
fix it but I’ll do that tomorrow. The regexp should simply be “N.|SENT_END"
instead of "N.”

I think it’s because there is an exception with the d’ .
Those rules singular/plural and masculine/feminine match rules
are surprisingly tricky to get right without too many false positive
and false negative, but clearly the rule can be improved here. I’ll
have a look hopefully soon.

artofit · February 5, 2014, 7:12am

I’ll look deeper at rules construction this w-e

Started to “clean” grammar.xml to have a regression db, I believed to have issues with the rule id=“ENTRAIN”, but I was wrong, as it does finds error into :
Qui était entrain de manger?
Ce dernier était entrain de manger.

FYI: