Update en_GB speller

marcoagpinto · September 20, 2016, 8:51am

Hello!

Could someone update the GB speller to my version 2.40:

?

Thanks!

Kind regards,
     >Marco A.G.Pinto
       ------------------------

Mike_Unwalla · September 20, 2016, 10:36am

@marcoagpinto, @danielnaber, I am happy to learn how to do this task, but I have not got a clue about where to start. Daniel, is the task applicable to me in my role as English maintainer, or is the task something that must be done by one of the development team?

dnaber · September 20, 2016, 11:11am

The dictionary hasn’t been updated for years - the process is documented at Spell check - LanguageTool Wiki but I’m not sure whether it will actually still work that way. Feel free to give it a try.

Mike_Unwalla · September 20, 2016, 3:08pm

“To create a morfologik dictionary under Linux, you can use create_dict.sh.”

I use Windows.Is there a Windows equivalent of create_dict.sh?

dnaber · September 20, 2016, 3:22pm

I don’t think so. It doesn’t do much that’s specific to a Linux Shell, so if you know bash and Windows scripting it can probably be ported (but I know nothing about Windows scripting).

Mike_Unwalla · September 20, 2016, 4:00pm

@danielnaber, thanks. @marcoagpinto, unfortunately, I cannot help you with the update of the dictionary. I know nothing about bash or Windows scripting.

marcoagpinto · September 22, 2016, 4:04am

@danielnaber @Yakov @Mike_Unwalla

Can’t Yakov or Daniel update it?

The current version in LanguageTool has tons of words missing.

My GB speller has around 25K new words.

Three months ago it was suggested adding it but so far no one took care of it.

Yakov · September 22, 2016, 5:43am

Is the task just add words to spellcheck dictionary from
https://github.com/marcoagpinto/aoo-mozilla-en-dict/blob/master/en_GB%20(Marco%20Pinto)/wordlist_marcoagpinto_20160901_160880w.txt
or need use en-GB.aff and en-GB.dic?

marcoagpinto · September 22, 2016, 7:28am

@Yakov
It is a Hunspell dictionary.

The files needed are the .DIC + .AFF + README.

Yakov · September 22, 2016, 9:55am

I make dict with script languagetool/make_en_gb_dict.sh at master · languagetool-org/languagetool · GitHub
but it failed tests.

Possible problem is wrong freqency word data in

github.com

mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/en_gb_wordlist.xml

<wordlist locale="en_GB" description="English (UK)" date="1381226409" version="42">
 <w f="222" flags="">the</w>
 <w f="215" flags="">to</w>
 <w f="214" flags="">of</w>
 <w f="212" flags="">and</w>
 <w f="210" flags="">in</w>
 <w f="208" flags="">a</w>
 <w f="201" flags="">was</w>
 <w f="200" flags="">is</w>
 <w f="200" flags="">this</w>
 <w f="196" flags="">I</w>
 <w f="196" flags="">as</w>
 <w f="196" flags="">for</w>
 <w f="195" flags="">on</w>
 <w f="195" flags="">with</w>
 <w f="194" flags="">by</w>
 <w f="192" flags="">that</w>
 <w f="191" flags="">from</w>
 <w f="190" flags="">at</w>
 <w f="190" flags="">his</w>

This file has been truncated. show original

Or some words from tests are present in freqency word data but missed in the hunspell dictionary.

marcoagpinto · September 22, 2016, 10:13am

What does that mean?

Can it be fixed?

Yakov · September 22, 2016, 10:51am

I can:

Use another version of en_gb_wordlist.xml
Correct and review LT tests.

I’ll try this solutions today.

jaumeortola · September 22, 2016, 2:54pm

I’m taking a look. So far I can build the dictionary. There seems to be no problem with the frequency word list.

There are some tests that fail:

“Transexual” is included in the new dictionary (before it wasn’t), and there is a test that expects it to be “transsexual”.
“Doesn’t” gives a spelling error. That’s because the words from the new dictionary need to be properly tokenized.

These problems can be fixed. Anyway, I think it is not advisable to make the dictionary update now, a few days ahead of the release. We need more time and some more tests.

marcoagpinto · September 22, 2016, 7:14pm

“Definition of transsexual in English:transsexual (also transexual)”

It can be written both ways.

Anyway, we spoke about a dictionary update three months ago.

I hope it will be available (updated) in the next three months (LT 3.6) since the old speller has thousands of words missing.

Yakov · September 22, 2016, 8:40pm

I fixed it.
I added to dictionary “Ph” (for Ph.D.), “doesn” (for doesn’t),
removed from dictionary “transexual”,
and tuned frequency for some words in en_gb_wordlist.xml

marcoagpinto · September 22, 2016, 8:48pm

Thanks, Yakov!

We now have the ultimate British speller!!!

jaumeortola · September 22, 2016, 9:06pm

Yakov,

I think this is too hasty.

There is a lot of things that probably need to be fixed (before the release). For example, now we have spelling errors in: ain’t, aren’t, couldn’t… (similar to doesn’t).

Besides, the new dictionary contains a lot of genitives that are unnecessary: Aachen’s/Aachen, Aarhus’s/Aarhus, Aaron’s/Aaron… (~30000 words).

We should have tokenized the words properly, before building the new dictionary.

Remember that we avoid committing often new versions of the dictionaries because they take a lot of space in the github repository.

Yakov · September 23, 2016, 6:37am

And need remove dot for words like “etc.”

jaumeortola · September 23, 2016, 7:38am

You need to do a full tokenization. The tokenization will take care of the problems with “doesn’t”, “etc.”, genitives…

In English I think you can do it with a simple script, using the tokenizing characters (here and here).

Or you can do it with the LT tokenizer like in this file (now deleted).

The separator character (+) should be changed in en_GB.info (for example with “_”), because there is a word containing it (DVD+RW), or this word should be removed.

Anyway, with 25000 new words in the dictionary, many unexpected issues can arise…

Yakov · September 24, 2016, 6:44am

I fixed it with script make_en_gb_dict.sh
This script uses LT tokenizer.
New dictionary check words like “doesn’t”, “etc.”, “Aachen’s”, “ain’t”, “aren’t”, “couldn’t”, “DVD+RW” correctly.