Neural Network Rules

tiagosantos · October 14, 2017, 4:01pm

That would be a great opportunity to learn a bit more about how that works. Regarding neural networks, I only know the concept, and it really interests me. I also prefer using some only newspapers for rule validation and creation and I can point you to proper sources. I would stay away from Wikipedia and Tatooeba when creating the model, because they merge 4 different Portuguese standards, which would lower the quality of the model.

Google doesn’t have a publicly available Portuguese n-gram data set, even though I prepared Portuguese to be ready to accept n-gram data, if the user builds it or buys it online. Some disabled confusion pair can be found here:

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/confusion_set_candidates.txt

# Portuguese confusion sets, mostly homophones
# This file is not used by LanguageTool - to activate a pair, move it to confusion_sets.txt
# This list is a conversion of the paronyms list created by Konfekt, 
# licensed under WTFPL (https://en.wikipedia.org/wiki/WTFPL)
# https://gist.github.com/Konfekt/141b085ec247d8001058e4924954b402
#
# Factor 10000 produces on average good results without adjustments. 
# Users are welcome to test and report improved values for mainstream inclusion.
# Run ConfusionRuleEvaluator to obtain missing values.
#
# abade; abate; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abafo; abalo; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abalo; abano; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abano; abono; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abano; ébano; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# aberta; alerta; 10000;                           # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abeto; afeto; 10000;                             # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abjeção; objeção; 10000;                         # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# abordo; aborto; 10000;                           # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY
# aborto; acordo; 10000;                           # p=x.xxx, r=y.yyy, zz+ZZ, Ngrams, DD-MM-YYYY

This file has been truncated. show original

NOTE: The header says # English confusion sets,, because I forgot to change that when creating the dummy file. The pairs are Portuguese.

The best pair for me are not there. e->é and por->pôr pairs are best suited for this rule, since pattern rules are not good detecting these confusions.

The best free corpora I know for this task is the DCEP: Digital Corpus of the European Parliament.
It is available for all major European Languages, and even some minor ones.
You can find on DCEP: Digital Corpus of the European Parliament - European Commission

If you guide me into the process, I can do some grunt work, and I will definitely give my documentation feedback if you wish to.