Create a dictionary for French language and use it for spell checking

Perfect ! The java libraries used for spelling are all up-to-date? Or, some of them have changed / considered deprecated?

Using the hunspell native code (which comes with LT) is kind of deprecated, although I don’t see it being removed any time soon. If you stick to the morfologik code (which is pure Java) as described in the Wiki, everything should be fine.

1 Like

Can’t find / load main class org.languagetool.dev.SpellDictionaryBuilder

I’ve updated the Wiki page with the latest class name. The parameters might be different, but the class will show its usage when you call it without parameters.

I was looking to my directories… I found org/languagetool but there isn’t a “dev” directory. Maybe for this reason it doesn’t work.

The classes are in the JAR files, so you won’t usually find them directly as files in the file system. The class as given in the wiki should work now.

I’m not a Java expert but this is what it happens:

C:\Users\KP\Desktop\LanguageTool-3.3>java -cp languagetool.jar org.languagetool.
tools.SpellDictionaryBuilder fr_FR C:/Users/KP/Desktop/LanguageTool-3.3/french_d
ict.txt C:/Users/KP/Desktop/LanguageTool-3.3/org/languagetool/resource/fr/french
.info - -o C:/Users/KP/Desktop/LanguageTool-3.3/output.dict
Errore: impossibile trovare o caricare la classe principale org.languagetool.too
ls.SpellDictionaryBuilder

You also need languagetool-tools.jar in your classpath, something like
java -cp languagetool.jar;libs/morfologik-tools.jar …

Please try a recent snapshot from Index of /snapshots/.

1 Like

Now is working but still I can’t create the dictionary file. I think I should put an option + parameter but… Where?

C:\Users\KP\Desktop\LanguageTool>java -cp languagetool.jar org.languagetool.tool
s.SpellDictionaryBuilder fr_FR -i C:/Users/KP/Desktop/LanguageTool/french_dict.t
xt -info org/languagetool/resource/fr/french.info -o C:/Users/KP/Desktop/Languag
eTool/output.dict
Running Morfologik FSACompile.main with these options: [--exit, false, -i, C:\Us
ers\KP\AppData\Local\Temp\SpellDictionaryBuilder257816875475185246.txt, -o, C:\U
sers\KP\Desktop\LanguageTool\output.dict, -f, CFSA2, --overwrite]
Invalid argument: Unknown option: --overwrite

Usage: fsa_compile [options]
  Options:
    --accept-bom
       Accept leading BOM bytes (UTF-8).
       Default: false
    --accept-cr
       Accept CR bytes in input sequences (\r).
       Default: false
    -f, --format
       Automaton serialization format.
       Default: FSA5
       Possible Values: [FSA5, CFSA2]
    --ignore-empty
       Ignore empty lines in the input.
       Default: false
  * -i, --input
       The input sequences (one sequence per \n-delimited line).
  * -o, --output
       The output automaton file.
**Done. The binary dictionary has been written to C:\Users\KP\Desktop\LanguageTool\output.dict**

@jaumeortola This looks like a bug. Is this something you could fix?

It looks like the dictionary was actually created. I see this message when I create my dictionary and it succeeds.
But I agree we need to remove the message. :slight_smile:

Yes it looks like the dict was created but it wasn’t. I already executed the code on Windows 7 and Ubuntu with Java 8 and… Nothing has changed !
Maybe the .txt file with the dictionary word list on it, should have a different layout?
In my file, words are listed in this way - utf-8:

a
b
c
d
...etc 

It looks like FSACompile doesn’t work.

Invalid argument: Unknown option: --overwrite

Moreover, I noticed that the file created by LT (I suppose) C:\Users\KP\AppData\Local\Temp\SpellDictionaryBuilder257816875475185246.txt
disappears few seconds after the code is executed. I thought that it was my antivirus and I disabled it… But again nothing has changed.

Any Suggestions?

It was a bug that should be fixed now. The fix will be in the next daily build, to be published later tonight at Index of /snapshots/

1 Like

Now it works ! Many thanks !
Just another and last question: in order to end the process of creation of the french dictionary for the spell checker, I should follow the instructions of “developers” section in the wiki, shouldn’t I ?

If you already have a word list, you can do what’s described at “For Users” (Spell check - LanguageTool Wiki). The script mentioned in the “For Developers” section does almost the same, only that it first creates a list of words from the hunspell dictionary.

1 Like

Hello again,
I was maybe too much optimist!
I noticed that after creating the dictionary, its dimension is 1KB and my original dictionary.txt is 3.646 KB instead. Is this normal?

Then I tried to convert the just created dictionary.dict, in the related word list (in order to see if the content is the same as my word list) following the instruction of the wiki… But I can’t as you can see in the code below.

I copied the “french.info” file required, from org\languagetool\resource\fr… Maybe Do I need another .info file?

I noticed also that in org\languagetool\resource\fr there’s already a french dictionary.dict and after doing the same process, it returned a corrected word list.

Here’s my code in command line:

Here I created my dictionary using my word list:
> C:\Users\HP\Desktop\LT>java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder fr_FR -i C:/Users/HP/Desktop/LT/french.txt -info C:/Users/HP/Desktop/LT/french.info -o C:/Users/HP/Desktop/LT/output.dict
> Running Morfologik FSACompile.main with these options: [–exit, false, -i, C:\Users\HP\AppData\Local\Temp\SpellDictionaryBuilder781820826994572409.txt, -o, C:\Users\HP\Desktop\LT\output.dict, -f, CFSA2]
> Done. The binary dictionary has been written to C:\Users\HP\Desktop\LT\output.dict

Then I tried to re-convert the dictionary in a word list to see its content:

> C:\Users\HP\Desktop\LT>java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i C:/Users/HP/Desktop/LT/output.dict -info C:/Users/HP/Desktop/LT/french.info -o C:/Users/HP/Desktop/LT/test.dict
> Running Morfologik DictDecompile.main with these options: [--exit, false, -i, C:\Users\HP\Desktop\LT\output.dict, -o, C:\Users\HP\AppData\Local\Temp\DictionaryExporter_separator1455210922122033918.txt, --overwrite]
 > _> An unhandled exception occurred. Stack trace below.

_ > java.nio.file.NoSuchFileException: C:\Users\HP\Desktop\LT\output.info
_ > at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
_ > at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
_ > at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
_ > at sun.nio.fs.WindowsFileSystemProvider.newByteChannel(Unknown Source)
_ > at java.nio.file.Files.newByteChannel(Unknown Source)
_ > at java.nio.file.Files.newByteChannel(Unknown Source)
_ > at java.nio.file.spi.FileSystemProvider.newInputStream(Unknown Source)
_ > at java.nio.file.Files.newInputStream(Unknown Source)
_ > at morfologik.stemming.Dictionary.read(Dictionary.java:64)
_ > at morfologik.tools.DictDecompile.call(DictDecompile.java:62)
_ > at morfologik.tools.DictDecompile.call(DictDecompile.java:20)
_ > at morfologik.tools.CliTool.main(CliTool.java:133)
_ > at morfologik.tools.DictDecompile.main(DictDecompile.java:132)
_ > at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:80)
_ > at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59)
Done. The dictionary export has been written to C:/Users/HP/Desktop/LT/test.dict

Update

I changed the encoding of my word list, I created a .info file and after executed the code, it created a file .dict of 205 KB.
I tried to open this file but the content is not readable.
How can I check if the .dict file has been well created?

Theoretically the fastest way to check if your dictionary was created right is by using morfologik tools:

But for that you’ll need to clone and build that tool.

Since I got it, I would like to post here step by step how to create the dictionary and how to use it, for people who will need it again.

Hello this is my experience in creating a dictionary for spell checking with Language Tool ! Hope you enjoy it.

Part 1: How to create the dictionary

You need:

• A .txt file with the dictionary inside

• An .info file specifying the info on how to set LT output file (It is already present in LT directory).

• LanguageTool standalone version

• Java 8

At the end of this section, you will have:

• a .dict file i.e. the file with your dictionary in a readable form for LT

  1. Install the LAST version of LT: Index of /snapshots/
  2. Be sure to have your .txt in the right format (a) and encoding (b):
    a. 1 word par line
    b. UTF8 encoding
  3. In the command line write:
    a. java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder fr_FR -i path of the dictionary file -info path of the .info file -o path of the output file

where:

i. fr_FR is the code related to the language of the dictionary

ii. –i it’s the parameter of the input file (your .txt)

iii. –info it’s the parameter of the .info file related to the dictionary. You can create it following these instructions (Spell check - LanguageTool Wiki - “Configuring the dictionary” section) or use the .info already present – if present – in \org\languagetool\resource\yourlanguage

iv. –o it’s the parameter for specifing where you wish to save the .dict output file


Part 2: How to integrate the dictionary on LT for spell checking

You need:

• JDK 1.8 (http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)

• Maven (Maven – Download Apache Maven)

• IDE for Java (JetBrains, Eclipse, etc.)

• .info file + .dict file (see part1)

• GitHub LanguageTool project (GitHub - languagetool-org/languagetool: Style and Grammar Checker for 25+ Languages)

  1. Set the JDK and Maven bin path (more info: Maven – Installing Apache Maven)
  2. Copy the .info and .dict files created on part1 in \languagetool-master\languagetool-language-modules\YourLanguage\src\main\resources\org\languagetool\resource\YourLanguage\hunspell
  3. Open with your IDE the java file called as the language of your dictionary (for ex. French.java) :

a. Change HunspellNoSuggestionRule in YourLanguage.java to MorfologikYourLanguageSpellerRule

 @Override
  public List<Rule> getRelevantRules(ResourceBundle messages) throws IOException {
    return Arrays.asList(
new CommaWhitespaceRule(messages),
new DoublePunctuationRule(messages),
new GenericUnpairedBracketsRule(messages,
Arrays.asList("[", "(", "{" /*"«", "‘"*/),
Arrays.asList("]", ")", "}"
/*"»", French dialog can contain multiple sentences. */
/*"’" used in "d’arm" and many other words */)),
new MorfologikYourLanguageSpellerRule(messages, this),
new UppercaseSentenceStartRule(messages, this),
new MultipleWhitespaceRule(messages, this),
new SentenceWhitespaceRule(messages),
// specific to French:
new CompoundRule(messages),
new QuestionWhitespaceRule(messages)
);
}

b. Create the new MorfologikYourLanguageSpellerRule.java in \languagetool-master\languagetool-language-modules\YourLanguage\src\main\java\org\languagetool\rules\YourLanguage :

/* LanguageTool, a natural language style checker
 * Copyright (C) 2012 Marcin Miłkowski (http://www.languagetool.org)
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public
 * License along with this library; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301
 * USA
 */

package org.languagetool.rules.fr;

import java.io.IOException;
import java.util.ResourceBundle;

import org.languagetool.Language;
import org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule;

public final class MorfologikYourLanguageSpellerRule extends MorfologikSpellerRule {

    public static final String RULE_ID = "MORFOLOGIK_RULE_CODEOFYOURLANGUAGE"; /* for ex. Fr_FR for French */

    private static final String RESOURCE_FILENAME = "PATH TO YOUR .DICT FILE";

	public MorfologikFrenchSpellerRule(ResourceBundle messages,
                                      Language language) throws IOException {
    super(messages, language);
  }

    @Override
    public String getFileName() {
        return RESOURCE_FILENAME;
    }

    @Override
    public String getId() {
        return RULE_ID;
    }
}

c. Go to \languagetool-master\ with your command line and write : Mvn package

d. See your results in \languagetool-master\languagetool-standalone\target\LanguageTool-3.4-SNAPSHOT\LanguageTool-3.4-SNAPSHOT.

1 Like