Brazilian contribution (GSoC 2018)

Ferch42 · February 14, 2018, 10:43pm

Hi everyone!

I’m Brazilian and I’m looking forward on how I can help improve the Tool. Especially making it better for my mother language. I have Java background and some experience to NLP. How can I help?

dnaber · February 15, 2018, 8:03am

Hi, welcome to LanguageTool! Maybe @tiagosantos can share his ideas, he is working a lot on European Portuguese. Adding rules in grammar.xml is often the easiest approach. For GSoC, we have to keep in mind that it’s about programming, so just writing rules in XML would not be enough.

tiagosantos · February 15, 2018, 8:07pm

Hi @Ferch42,

You catch me in a particularly busy period, but I will try to assist as possible, though probably not every day. I believe that the Portuguese section needs the most is a bit polish.

There are many typos that need to be fixed, urls that need a review, rule name inconsistencies, imprecisions, etc. These would be the very easy tasks, but not necessarily about coding. However, these useful tasks can introduce you to the metalanguage used by LT, without losing time with futile exercises.
Then there is the need to make Brazilian Portuguese have the same coverage as European Portuguese. The vast majority of the rules me and Marco added work for all Portuguese languages, but there were some language specific rules that I added only to European Portuguese. I have only converted to pt-BR a couple of those dialect related rules. This would be another easy task, but great to see if you master LT logic and the metalanguage.
Then there are a few ideas that can apply specifically to Portuguese. For example, making a <message> simple word ‘translator’, so that strings in the main grammar.xml (shared by all variants) can be used with pt-BR, pt-AO and other variants, without having to copy the rules from the main branch, just to change a word in the message. This would need java knowledge to interface with the existing API.
The other short term project, that I have already started, but given my availability, is tending to take long, is to add a decent list of confusion pairs to the word2vec models.

This is a more difficult task: Create a word extractor that uses hunspell logic to find words similar with dictionary words. To make my idea easier to understand. Create a list with all hunspell words. Consider each word in the list a ‘misspelled’ word. Which would be the first word suggested as a replacement? That is the confusion pair. The biggest issue is doing so with the already existing criteria given in the affix file, and do all combination in an efficient manner, since there are too many combination to test.

This development tool will be useful for all languages.
Also for all languages you can check this:

There are some good suggestion from various commiters, including one from me (hopefully not totally unreasonable).
I reaffirm the need to create a metalanguage for chunkers, that can mimic Freeling, but that has to be developed independently, for licensing reasons. This would be the most time consuming task, but you would need to use your NLP skills in addition to Java. You wouldn’t have to worry that much about the chunker in itself, just on the mechanism to group sintatic group into chunks, through a simple text metalanguage.

Other important tasks, but potentially frustrating to work on:
make the phrase system work as intended (messes up rules and doesn’t work in all logical situation),
fix multiple rule creation with and/or tags,
multiplatform GUI improvements,
port all metalanguage code options used in grammar.xml to disambiguation.xml.

Hope this gives some food for thought.
Best regards,

Tiago Santos

P.S. - Oh! And hopefully, welcome to the project!

Ferch42 · February 16, 2018, 7:54pm

Much thanks for the welcome, the overview and the suggestions! I feel like starting simple then getting to the complicated stuff afterwards. I will start reviewing rules and later on I would like to take a shot on this word translator.
Thanks everyone!

Ferch42 · February 21, 2018, 4:36pm

Hi @tiagosantos

Could you please further the explanation on that word extractor? I would like to take a shot on it.

Thanks

tiagosantos · February 21, 2018, 6:55pm

Hi @Ferch42,

That is great and connects to the other project you have shown interest for, the word2vec models.
There are scripts to extract all possible words from a hunspell dictionary. For example:

https://sourceforge.net/p/hunspell/patches/55/

Then you can have a sorting/connections script like the ones drafted here:

github.com/languagetool-org/languagetool

[pt] list of commonly confused paronyms

opened 02:46PM - 15 Nov 16 UTC

closed 09:31AM - 20 Aug 19 UTC

Konfekt

enhancement Portuguese

The script https://gist.github.com/Konfekt/2328e1ae5f8305199d46fbcbe9dec01b take…s a word list and groups them into paronyms, words that differ by a single letter. However, taking a spell file such as http://www.winedt.org/dict/portuguese.zip, there are far too many. For example by containing all the conjugated verb forms. Maybe there are separate list of portuguese substantives, adjectives, infinitives and such?

The base script is in here:

github.com

Konfekt/languagetool-paronimios/blob/master/paronyms

#!/usr/bin/perl
use Text::Levenshtein::Damerau;
# Defaults to using Pure Perl Text::Levenshtein::Damerau::PP, but has an XS addon Text::Levenshtein::Damerau::XS for massive speed imrovements.
# use Text::Levenshtein::Damerau::XS qw/xs_edistance/;
use utf8;

my (@word_list, @paronyms);

my $file = shift;
die "Usage: $0 LIST" unless $file;
open(LIST, "<:encoding(UTF-8)", $file)  or die "$0: Can't open $file $!";
push(@word_list, $_) while <LIST>;   # Read whole list

# Process whole list
foreach my $word (@word_list) {
  my $tld = Text::Levenshtein::Damerau->new($word);
  my $d_ref = $tld->dld({ list => \@word_list, max_distance => 2 });
  while( my ($k, $v) = each %$d_ref ) {
    # print "$k";
    push(@paronyms, $k);

This file has been truncated. show original

This script is limited to working with the REP (common phonetic confusion/replacement table). Keyboard distance is also useful.
Don’t forget to ask for the license and permission to Konfekt, if you base your work on this.
There is also the very useful script from @dnaber here:

It can be used as a second pass to filter the resulting list, but, it wouldn’t provide the best possible results, that are achieved by combining keyboard distance and Levenshtein distance in the suggestion sorting algorithm provides the best results.
Since you are into GSoC, and this is a coding exercise, you may wish to merge both functions into a single program and add a few more tuning features. Remember that the list should be comprehensive to be trimmed later on, since each word2vec pair takes 4min minimum to process depending on the rig/GPU you have.

Ferch42 · February 28, 2018, 2:06am

Hi there

I have built a word extractor for confusion pairs . I tried to extend the idea of using a REP table to build possible words. So I separated the work into two functions, which I called PhonemeSubstitution and VowelSubstitution. The PhonemeSubstitution aims to change possible consonant sounds that sound alike. VowelSubstitution changes the vowels. Using this approach, I tried to shorten the REP table, limiting it to only consonant sounds. Furthermore, I added a function that creates possible permutations of the words, but only considering a narrow window of 2 characters that may shift on the word and a function that returns variations of the word one character shorter. The window size can be changed and I believe that combinations of these functions may create possible words of second order.
I tried to keep the program simple in order to it to be applicable to any language. So vowel substitution is quite limited by default, and the phoneme substitution table (REP table) must come from an input file. I would like to highlight that preprocessing of the dictionary is essencial, as it increases the speed considerably.
I tried to apply it to a hunspell dict for pt-br and even though I got some results, I believe the list I got is far too big and not comprehensible enough, as it was not filtered for plurals and verb inflections. I would like to ask please for feedback
My work can be found GitHub - Ferch42/Word-Extractor: A confusion pairs word extractor

dnaber · February 28, 2018, 9:15am

That sounds like a good approach! Can you maybe post a part of the result, so we can have a look without running it ourselves?

tiagosantos · February 28, 2018, 3:47pm

This. Since I can’t install compilers at the moment, I tried your code on Python Online Compiler & Interpreter - Replit without success. Limit the script to 500-1000 pairs for the sake of brevity. This will allow seeing the results and suggest improvements more easily.

What I gather from this algorithm looks good so far. The pair list is necessarily big, maybe as big as the dictionary itself, That is why providing the best matches per pair and filtering results is extremely important for this task.

This approach has pros and cons. You will be able to create useful lists straight away and their accuracy will increase as you add more comparison functions and filters. The issue is that, each language will have to adapt the code to create a suitable list, i.e. each list/language will need, at least, a new phonemes list, a non-matching filter for plurals and declinations, and a character map table. Though trivial once done, the difficulty barrier may be too high for some.

The other approach is the one I suggested in the other thread. Fork and tweak hunspell so it uses it’s own matching function to produce this list. It would be able to parse, interpret and use the existing dictionaries, so it would save you time looking for function or coding each relevant matching function. Being written mostly in C++ may also be an issue.

Anyway,. I’m looking forward to seeing where this leads.

Ruud_Baars · February 28, 2018, 3:53pm

I once learned about an algorithm from the OCR developments at the Tilburg Univirsity. In short it does this:

give every letter a number (ascii value)
get only distinct letters from the word
compute the sum of the 5th power of all unique letters
This number stands for a lot of varieties. It can be adapted more, by having the most common letter combination mistakes (rep) as a number value to add or subtract from this number.
The results could be filtered by the levenshtein distance, in relation to the length of the word to rat the alternatives; maybe the frequency is an option too.

Ferch42 · March 1, 2018, 2:30am

Thanks everyone for the feedback

I will limit the size of the list to a few hundred pairs, and I have added a sample of the output file I got in my repo. It is matching a lot of plurals and verb inflections mainly because in portuguese these come quite easily by changing a vowel.
I agree that the preprocessing is quite demanding because of the need of filtering plurals and verb inflections and I liked the suggestion of tweaking hunspell. I will try to create a script based on hunspell that either does this filtering or extract the words directly. I don’t know that much of C++ though, but I will try.

Maybe this won’t do because to run the code the instalation of the package python-Levenshtein is necessary

@Ruud_Baars If possible, could you refer to more informations regarding your approach?

tiagosantos · March 1, 2018, 2:19pm

I was not “blaming” the code. I was just stating a limitation.

I’ve seen the list. The sorting of pairs gave quite rare words in the beginning, but with multiple suggestions the filtering can be effective to some degree. Anyway, the hunspell route is probably much more efficient, and if you are going to try it, this can be disregarded.