How does "disambiguation.xml" work?

pecstef · June 28, 2016, 4:02pm

Hello,
I created the dictionary file for french language (using mvn package and buiding a new LT snapshot) and it seems to work.
Unfortunately there’s a thing that I don’t understand: even if in my dictionary there’s the word “aujourd’hui”, when it appears in my list of words to be corrected, the spell checker sees this word as a mistake.

I created then a rule in disambiguation.xml for ignoring this word during the spell checking but nothing has changed.

Where am I making a mistake?

dnaber · June 28, 2016, 4:18pm

Could you post the XML you’ve added to disambiguation.xml?

pecstef · June 28, 2016, 4:23pm

I tried in 2 different ways:

  <rule name="aujourdhui" id="aujourdhui">
        <pattern>
            <marker>
                <token>aujourd'hui</token>
            </marker>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

  <rule name="aujourdhui" id="aujourdhui">
        <pattern>
            <marker>
                <token>aujourd'</token> <token>hui</token>
            </marker>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

jaumeortola · June 28, 2016, 4:30pm

As you can see here: Text Analysis - LanguageTool, this word is tokenized in three tokens. You need:

<rule name="aujourdhui" id="aujourdhui">
        <pattern>
            <marker>
                <token>aujourd</token>
                <token spacebefore="no">'</token>
                <token spacebefore="no">hui</token>
            </marker>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

You can try to add the word to the file “multiwords.txt” in the French resource foulder. I’m not sure if it will work in French.

pecstef · June 28, 2016, 4:48pm

Ok thanks It works !
I noticed also that when I launch the command to start the spell checker, the output file changes from utf8 to ANSI, even if the window console is set as 65001 - unicode char and even if I specify the encoding during the launch of the spell checker. Any suggestions?