Back to LanguageTool Homepage - Privacy - Imprint

Correction of rules


(artofit) #1

As I start to look into the grammar.xml I do see couples of issues.

For instance in the rule:

....

<rule>

        <pattern>
          <token inflected="yes" regexp="yes">être|situer|tenir</token>
          <token regexp="yes">pas|plus|moins|très|peu|toujours</token>
          <marker><token regexp="yes">prêts?</token></marker>
          <token regexp="yes">de?</token>
        </pattern>
        <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
        <example type="incorrect">Nous ne sommes pas <marker>prêts</marker> d’arriver.</example>
        <example type="correct">Nous ne sommes pas près d’arriver.</example>
      </rule>

It obviously fails to detect errors such as:
Es-tu plus prêts d'elle?
Mets-toi plus prêts de la lumière!
Assieds-toi prêts du feux.

Oddly, (as I understand), it does found in:
Il est pas prêts d'y arriver

If I correctly understand the grammar rules a better solution would be

prêts?
de?|du|d'elle?|d'eux|d'y

Possibly
de[a-z]*|du|d'[a-z]+

I'm looking for the equivalent in French of:

Questions:

I. regarding: http://wiki.languagetool.org/development-overview#toc0
element pattern, sub element marker: What part of the original text should be marked as an error. If all tokens are part of the error you can omit this element.

Can you formulate in another way, I do not understand.
Example:

<rule id="VOIRE_MEME">    
      <pattern>
        <token regexp="yes" skip="1">voire? même</token>
      </pattern>
      <message>« Voire même » est un pléonasme. Employez <suggestion>voire</suggestion>, <suggestion>même</suggestion>.</message>
      <mistake>voire même</mistake>
      <correct>voire</correct>
    </rule>

II. How do I know which rules leads to the suggestions in the java standalone environnement?

Thanks


(Daniel Naber) #2

The French tag set is described here: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/tagset.LT.txt

The can be used to set which parts of the pattern are underlined. If there's no marker, all the tokens of the pattern will be underlined for the user.

There's currently no way to get the matching rule from the stand-alone GUI. I've added that to my TODO list.


(Dominique PELLÉ) #3

If you use the command line version of LanguageTool, the -v option will tell you which rule triggers errors as well as which disambiguation rules match (disambiguation rules tweaks POS tags).

Example:



$ echo "Un erreur." | java -jar languagetool-standalone/target/LanguageTool-2.5-SNAPSHOT/LanguageTool-2.5-SNAPSHOT/languagetool-commandline.jar -l fr -v
Expected text language: French
Working on STDIN...
2396 rules activated for language French
<S> Un[un/D m s,] erreur[erreur/N f s,].[./M fin,</S>,]<P/> 
Disambiguator log: 

un:1 Un[un/N m s*,un/D m s*] -> Un[un/D m s*]

1.) Line 1, column 1, Rule ID: ACCORD_GENRE[1]
Message: « Un » et « erreur » ne semblent pas bien accordés en genre.
Un erreur. 
^^^^^^^^^  
Time: 750ms for 1 sentences (1.3 sentences/sec)

(Dominique PELLÉ) #4

That's not quite right. You can't put something like "d'elle" because it is made of 3 different tokens.
LT splits sentence in tokens (i.e. words) and matches by tokens. So what is in between ...
can only be be single word and not something like "d'elle".

What is between ... is what should be highlighted or underlined
when showing the mistake to the user. Often that may the same as all tokens in the
pattern (in which case ... is optional or implicit), but sometimes only
a few tokens in the pattens should be highlighted, so indicating them with ...
is then useful. It's better to highlight as few words as possible, so several errors have less changes to overlap.

Here is a simple example of a rule containing 3 tokens, but LT only highlights one of the words
as the error when there is a match:




<rulegroup id="PRES_PRET" name="près et prêt">    
          <rule>
            <pattern>
              <token>à</token>
              <token>peu</token>
              <marker><token>prêt</token></marker>
            </pattern>
            <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
            <example type="incorrect">C’est à peu <marker>prêt</marker> la même chose.</example>
            <example type="correct">C’est à peu <marker>près</marker> la même chose.</example>
          </rule>

That's not correct, because you've again put several words within ....

The actual rule in languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml is like this:




<rule id="VOIRE_MEME" name="voire même">    
          <pattern>
            <token regexp="yes" skip="1">voire?</token>
            <token>même</token>
          </pattern>
          <message>« Voire même » est un pléonasme. Employez <suggestion>voire</suggestion>, <suggestion>même</suggestion>.</message>
          <example type="incorrect"><marker>voire même</marker></example>
          <example type="correct"><marker>voire</marker></example>
        </rule>

I hope that helps
Dominique


(Dominique PELLÉ) #5

The problem is that interrogative verbs like « Es-tu » or things like « Mets-toi » are single tokens which have
currently no POS tag in LanguageTool. Putting a tag and giving infinitive of the verb as lemma would make
it more convenient to write rules to match them. That would be a nice improvement. I'll think further about
what's the best way to do it.

That's not odd, it matches this rule:




<rule>    
            <pattern>
              <token inflected="yes" regexp="yes">être|situer|tenir|tout|assez|bien</token>
              <marker><token regexp="yes">prêts?</token></marker>
              <token regexp="yes">d[eu]?|des</token>
            </pattern>
            <message>Voulez-vous écrire <suggestion>près</suggestion> ?</message>
            <example type="incorrect">Nous sommes <marker>prêts</marker> d’arriver.</example>
            <example type="correct">Nous sommes près d’arriver.</example>
          </rule>

(artofit) #6

The can be used to set which parts of the pattern are underlined. If there's no marker, all the tokens of the pattern will be underlined for the user.[/quote]

Well, I do understand, but why ?

There's currently no way to get the matching rule from the stand-alone GUI. I've added that to my TODO list. [/quote]
Great, I believe this would be usefull for debugging and regression testing.
Additionnaly, I suggest to auto number the lines in the text windows (same as Notepad++ for instance), for fast reference as for instance the result window will indicate line: 196, column 30.

echo "Un erreur." | java -jar languagetool-standalone/target/LanguageTool-2.5-SNAPSHOT/LanguageTool-2.5-SNAPSHOT/languagetool-commandline.jar -l fr -v[/quote]
Except if I miss the point, does not function in 2.4.1
echo "Un erreur." | java -jar languagetool.jar -l fr -v[/quote]
I'll think about it, https://www.languagetool.org/download/snapshots/

about tokens
So d'y[/quote]
is split in 3 tokens, whilst « Es-tu » or things like « Mets-toi » are single tokens which have
currently no POS tag in LanguageTool. Putting a tag and giving infinitive of the verb as lemma would make
it more convenient to write rules to match them. That would be a nice improvement. I'll think further about
what's the best way to do it.[/quote]
This would be worthwhile to put for instance into http://wiki.languagetool.org/development-overview#toc0
I draw your attention to composed words such as "porte-avions", where "porte" is a verb, so the importance of enforcing that after "-" it is a pronoun.


(Daniel Naber) #7

Well, I do understand, but why ?
[/quote]

I meant the text matched by , not the pattern itself.


(Dominique PELLÉ) #8

Except if I miss the point, does not function in 2.4.1
[/quote]
Yes it does work with LanguageTool-2.4.1 or even older.
The -v option of the command line version of LanguageTool has been available for a long time.