Update en_GB speller

Hello!

Could someone update the GB speller to my version 2.40:

?

Thanks!

Kind regards,
     >Marco A.G.Pinto
       ------------------------

@marcoagpinto, @danielnaber, I am happy to learn how to do this task, but I have not got a clue about where to start. Daniel, is the task applicable to me in my role as English maintainer, or is the task something that must be done by one of the development team?

The dictionary hasn’t been updated for years - the process is documented at Spell check - LanguageTool Wiki but I’m not sure whether it will actually still work that way. Feel free to give it a try.

“To create a morfologik dictionary under Linux, you can use create_dict.sh.”

I use Windows.Is there a Windows equivalent of create_dict.sh?

I don’t think so. It doesn’t do much that’s specific to a Linux Shell, so if you know bash and Windows scripting it can probably be ported (but I know nothing about Windows scripting).

@danielnaber, thanks. @marcoagpinto, unfortunately, I cannot help you with the update of the dictionary. I know nothing about bash or Windows scripting.

@danielnaber @Yakov @Mike_Unwalla

Can’t Yakov or Daniel update it?

The current version in LanguageTool has tons of words missing.

My GB speller has around 25K new words.

Three months ago it was suggested adding it but so far no one took care of it.

Is the task just add words to spellcheck dictionary from
https://github.com/marcoagpinto/aoo-mozilla-en-dict/blob/master/en_GB%20(Marco%20Pinto)/wordlist_marcoagpinto_20160901_160880w.txt
or need use en-GB.aff and en-GB.dic?

@Yakov
It is a Hunspell dictionary.

The files needed are the .DIC + .AFF + README.

:slight_smile:

I make dict with script languagetool/make_en_gb_dict.sh at master · languagetool-org/languagetool · GitHub
but it failed tests.

Possible problem is wrong freqency word data in

Or some words from tests are present in freqency word data but missed in the hunspell dictionary.

What does that mean?

Can it be fixed?

I can:

  1. Use another version of en_gb_wordlist.xml
  2. Correct and review LT tests.

I’ll try this solutions today.

I’m taking a look. So far I can build the dictionary. There seems to be no problem with the frequency word list.

There are some tests that fail:

  • “Transexual” is included in the new dictionary (before it wasn’t), and there is a test that expects it to be “transsexual”.
  • “Doesn’t” gives a spelling error. That’s because the words from the new dictionary need to be properly tokenized.

These problems can be fixed. Anyway, I think it is not advisable to make the dictionary update now, a few days ahead of the release. We need more time and some more tests.

“Definition of transsexual in English:transsexual (also transexual)”

It can be written both ways.

Anyway, we spoke about a dictionary update three months ago.

I hope it will be available (updated) in the next three months (LT 3.6) since the old speller has thousands of words missing.

I fixed it.
I added to dictionary “Ph” (for Ph.D.), “doesn” (for doesn’t),
removed from dictionary “transexual”,
and tuned frequency for some words in en_gb_wordlist.xml

Thanks, Yakov!

We now have the ultimate British speller!!!

:stuck_out_tongue:

Yakov,

I think this is too hasty.

There is a lot of things that probably need to be fixed (before the release). For example, now we have spelling errors in: ain’t, aren’t, couldn’t… (similar to doesn’t).

Besides, the new dictionary contains a lot of genitives that are unnecessary: Aachen’s/Aachen, Aarhus’s/Aarhus, Aaron’s/Aaron… (~30000 words).

We should have tokenized the words properly, before building the new dictionary.

Remember that we avoid committing often new versions of the dictionaries because they take a lot of space in the github repository.

And need remove dot for words like “etc.”

You need to do a full tokenization. The tokenization will take care of the problems with “doesn’t”, “etc.”, genitives…

In English I think you can do it with a simple script, using the tokenizing characters (here and here).

Or you can do it with the LT tokenizer like in this file (now deleted).

The separator character (+) should be changed in en_GB.info (for example with “_”), because there is a word containing it (DVD+RW), or this word should be removed.

Anyway, with 25000 new words in the dictionary, many unexpected issues can arise…

I fixed it with script make_en_gb_dict.sh
This script uses LT tokenizer.
New dictionary check words like “doesn’t”, “etc.”, “Aachen’s”, “ain’t”, “aren’t”, “couldn’t”, “DVD+RW” correctly.