Correction of rules

artofit · February 10, 2014, 5:13pm

As I start to look into the grammar.xml I do see couples of issues.

For instance in the rule:

…

<rule>    
        <pattern>
          <token inflected="yes" regexp="yes">être|situer|tenir</token>
          <token regexp="yes">pas|plus|moins|très|peu|toujours</token>
          <marker><token regexp="yes">prêts?</token></marker>
          <token regexp="yes">de?</token>
        </pattern>
        <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
        <example type="incorrect">Nous ne sommes pas <marker>prêts</marker> d’arriver.</example>
        <example type="correct">Nous ne sommes pas près d’arriver.</example>
      </rule>

It obviously fails to detect errors such as:
Es-tu plus prêts d’elle?
Mets-toi plus prêts de la lumière!
Assieds-toi prêts du feux.

Oddly, (as I understand), it does found in:
Il est pas prêts d’y arriver

If I correctly understand the grammar rules a better solution would be

prêts?
de?|du|d’elle?|d’eux|d’y

Possibly
de[a-z]*|du|d’[a-z]+

I’m looking for the equivalent in French of:

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/tagset.txt

These are mostly the tags of the Penn Treebank tagset as used by LanguageTool,
with examples. See "new tag" for tags introduced by LanguageTool.
For more details, also see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

CC    Coordinating conjunction: and, or, either, if, as, since, once, neither, less
CD    Cardinal number: one, two, twenty-four
DT    Determiner: a, an, all, many, much, any, some, this
EX    Existential there: there (no other words)
FW    Foreign word: infinitum, ipso
IN    Preposition/subordinate conjunction: except, inside, across, on, through, beyond, with, without
JJ    Adjective: beautiful, large, inspectable
JJR   Adjective, comparative: larger, quicker
JJS   Adjective, superlative: largest, quickest
LS    List item marker: not used by LanguageTool
MD    Modal: should, can, need, must, will, would
NN    Noun, singular count noun: bicycle, earthquake, zipper
NNS   Noun, plural: bicycles, earthquakes, zippers
NN:U  Nouns that are always uncountable		#new tag - deviation from Penn, examples: admiration, Afrikaans
NN:UN Nouns that might be used in the plural form and with an indefinite article, depending on their meaning	#new tag - deviation from Penn, examples: establishment, wax, afternoon
NNP   Proper noun, singular: Denver, DORAN, Alexandra

This file has been truncated. show original

Questions:

I. regarding: Development Overview - LanguageTool Wiki
element pattern, sub element marker: What part of the original text should be marked as an error. If all tokens are part of the error you can omit this element.

Can you formulate in another way, I do not understand.
Example:

<rule id="VOIRE_MEME">    
      <pattern>
        <token regexp="yes" skip="1">voire? même</token>
      </pattern>
      <message>« Voire même » est un pléonasme. Employez <suggestion>voire</suggestion>, <suggestion>même</suggestion>.</message>
      <mistake>voire même</mistake>
      <correct>voire</correct>
    </rule>

II. How do I know which rules leads to the suggestions in the java standalone environnement?

Thanks

dnaber · February 11, 2014, 6:45pm

The French tag set is described here: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/tagset.LT.txt

The can be used to set which parts of the pattern are underlined. If there’s no marker, all the tokens of the pattern will be underlined for the user.

There’s currently no way to get the matching rule from the stand-alone GUI. I’ve added that to my TODO list.

Dominique_PELLE · February 11, 2014, 9:11pm

If you use the command line version of LanguageTool, the -v option will tell you which rule triggers errors as well as which disambiguation rules match (disambiguation rules tweaks POS tags).

Example:

$ echo "Un erreur." | java -jar languagetool-standalone/target/LanguageTool-2.5-SNAPSHOT/LanguageTool-2.5-SNAPSHOT/languagetool-commandline.jar -l fr -v
Expected text language: French
Working on STDIN...
2396 rules activated for language French
<S> Un[un/D m s,] erreur[erreur/N f s,].[./M fin,</S>,]<P/> 
Disambiguator log: 

un:1 Un[un/N m s*,un/D m s*] -> Un[un/D m s*]

1.) Line 1, column 1, Rule ID: ACCORD_GENRE[1]
Message: « Un » et « erreur » ne semblent pas bien accordés en genre.
Un erreur. 
^^^^^^^^^  
Time: 750ms for 1 sentences (1.3 sentences/sec)

Dominique_PELLE · February 11, 2014, 9:31pm

That’s not quite right. You can’t put something like “d’elle” because it is made of 3 different tokens.
LT splits sentence in tokens (i.e. words) and matches by tokens. So what is in between …
can only be be single word and not something like “d’elle”.

What is between … is what should be highlighted or underlined
when showing the mistake to the user. Often that may the same as all tokens in the
pattern (in which case … is optional or implicit), but sometimes only
a few tokens in the pattens should be highlighted, so indicating them with …
is then useful. It’s better to highlight as few words as possible, so several errors have less changes to overlap.

Here is a simple example of a rule containing 3 tokens, but LT only highlights one of the words
as the error when there is a match:

<rulegroup id="PRES_PRET" name="près et prêt">    
          <rule>
            <pattern>
              <token>à</token>
              <token>peu</token>
              <marker><token>prêt</token></marker>
            </pattern>
            <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
            <example type="incorrect">C’est à peu <marker>prêt</marker> la même chose.</example>
            <example type="correct">C’est à peu <marker>près</marker> la même chose.</example>
          </rule>

artofit:

I do not understand.
Example:

<rule id="VOIRE_MEME">    
      <pattern>
        <token regexp="yes" skip="1">voire? même</token>
      </pattern>
      <message>« Voire même » est un pléonasme. Employez <suggestion>voire</suggestion>, <suggestion>même</suggestion>.</message>
      <mistake>voire même</mistake>
      <correct>voire</correct>
    </rule>

That’s not correct, because you’ve again put several words within ….

The actual rule in languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml is like this:

<rule id="VOIRE_MEME" name="voire même">    
          <pattern>
            <token regexp="yes" skip="1">voire?</token>
            <token>même</token>
          </pattern>
          <message>« Voire même » est un pléonasme. Employez <suggestion>voire</suggestion>, <suggestion>même</suggestion>.</message>
          <example type="incorrect"><marker>voire même</marker></example>
          <example type="correct"><marker>voire</marker></example>
        </rule>

I hope that helps
Dominique

Dominique_PELLE · February 11, 2014, 10:18pm

artofit:

As I start to look into the grammar.xml I do see couples of issues.

For instance in the rule:

…

<rule>    
        <pattern>
          <token inflected="yes" regexp="yes">être|situer|tenir</token>
          <token regexp="yes">pas|plus|moins|très|peu|toujours</token>
          <marker><token regexp="yes">prêts?</token></marker>
          <token regexp="yes">de?</token>
        </pattern>
        <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
        <example type="incorrect">Nous ne sommes pas <marker>prêts</marker> d’arriver.</example>
        <example type="correct">Nous ne sommes pas près d’arriver.</example>
      </rule>

It obviously fails to detect errors such as:
Es-tu plus prêts d’elle?
Mets-toi plus prêts de la lumière!
Assieds-toi prêts du feux.

The problem is that interrogative verbs like « Es-tu » or things like « Mets-toi » are single tokens which have
currently no POS tag in LanguageTool. Putting a tag and giving infinitive of the verb as lemma would make
it more convenient to write rules to match them. That would be a nice improvement. I’ll think further about
what’s the best way to do it.

That’s not odd, it matches this rule:

<rule>    
            <pattern>
              <token inflected="yes" regexp="yes">être|situer|tenir|tout|assez|bien</token>
              <marker><token regexp="yes">prêts?</token></marker>
              <token regexp="yes">d[eu]?|des</token>
            </pattern>
            <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
            <example type="incorrect">Nous sommes <marker>prêts</marker> d’arriver.</example>
            <example type="correct">Nous sommes près d’arriver.</example>
          </rule>

artofit · February 12, 2014, 9:36am

The can be used to set which parts of the pattern are underlined. If there’s no marker, all the tokens of the pattern will be underlined for the user.[/quote]

Well, I do understand, but why ?

There’s currently no way to get the matching rule from the stand-alone GUI. I’ve added that to my TODO list. [/quote]
Great, I believe this would be usefull for debugging and regression testing.
Additionnaly, I suggest to auto number the lines in the text windows (same as Notepad++ for instance), for fast reference as for instance the result window will indicate line: 196, column 30.

echo “Un erreur.” | java -jar languagetool-standalone/target/LanguageTool-2.5-SNAPSHOT/LanguageTool-2.5-SNAPSHOT/languagetool-commandline.jar -l fr -v[/quote]
Except if I miss the point, does not function in 2.4.1
echo “Un erreur.” | java -jar languagetool.jar -l fr -v[/quote]
I’ll think about it, Index of /snapshots/

about tokens
So d’y[/quote]
is split in 3 tokens, whilst « Es-tu » or things like « Mets-toi » are single tokens which have
currently no POS tag in LanguageTool. Putting a tag and giving infinitive of the verb as lemma would make
it more convenient to write rules to match them. That would be a nice improvement. I’ll think further about
what’s the best way to do it.[/quote]
This would be worthwhile to put for instance into Development Overview - LanguageTool Wiki
I draw your attention to composed words such as “porte-avions”, where “porte” is a verb, so the importance of enforcing that after “-” it is a pronoun.

dnaber · February 12, 2014, 2:59pm

I meant the text matched by , not the pattern itself.

Dominique_PELLE · February 12, 2014, 10:25pm

Yes it does work with LanguageTool-2.4.1 or even older.
The -v option of the command line version of LanguageTool has been available for a long time.