[pt] Ricardo Joseh Lima – Linguistic advisor

marcoagpinto · April 17, 2022, 6:48pm

Hello Daniel and core team,

I wondered if Ricardo could officially become part of the volunteers team, since he is a plus to the project.

He is a linguistic professor in a Brazilian university, which adds value to LanguageTool.

He has been helping me to revise rules and creating new ones.

Thanks!

dnaber · April 18, 2022, 8:44am

@rjlima Would you like to become a committer for LanguageTool? We don’t really have the concept of an “official part of the team”, other than people having commit permissions on the GitHub repo.

rjlima · April 18, 2022, 1:50pm

That would be great! I have no experience in committing, just know a few things about it, but maybe there’s some tutorial I can study it and begin, with a little help from @marcoagpinto even better!

dnaber · April 18, 2022, 2:06pm

Great! I suggest you submit your first changes as pull requests. Once that works, I can give you write permissions on the repo. You can find some tips and links at Tips for new Committers | dev.languagetool.org.

rjlima · April 18, 2022, 2:09pm

Great! I will see some cases and think about a simple change/rule, but it will take some time to happen.

marcoagpinto · April 19, 2022, 4:04am

@rjlima

After checking for morphological information in Priberam or Infopédia, I add missing words to:
spelling.txt

and POS information to:
added.txt

Notice that spelling.txt shouldn’t have language-specific words, or in simple, the words added there must be common to all variants (remember that they are added to LibreOffice’s internal dictionary).

In added.txt notice that there is a part for pt-BR POS data.

For words with more than a word, they must be added to multiwords.txt.

Basically, you can find all the important files by searching in the repository for:

barbarisms-pt.txt
portuguese.info  (added.txt folder)

For years that I have been searching for the folders the hard way, but these two filenames above make you find everything.

I have been compiling a list of POS tags, but notice that years ago Priberam would call “substantivo” to “nome comum” and that is why you will still see here “substantivo” (too lazy to change the wording).

################################################
################################################
################################################
################################################

adj. 2 g.
informacional | adj. 2 g.
AQ0CS0
masc. e fem. pl. de informacional
AQ0CP0


adj. 2 g. 2 núm.
unissexo | adj. 2 g. 2 núm.
AQ0CN0


adj. 2 g. s. 2 g.
budista | adj. 2 g. s. 2 g.
AQ0CS0
NCCS000
budistas | adj. 2 g. s. 2 g.
AQ0CP0
NCCP000


adj. s. f.
tomadora | adj. s. f.
AQ0FS0
NCFS000
tomadoras | fem. pl. de tomador
AQ0FP0
NCFP000


adj. s. m.
tomador | adj. s. m.
AQ0MS0
NCMS000
tomadores | masc. pl. de tomador
AQ0MP0
NCMP000


adv.
sintaticamente | adv.
RG


gerúndio de verbo transitivo/intransitivo/pronominal
bebendo | gerúndio de beber
VMG0000


fem. sing. part. pass transitivo e intransitivo
bebida | singular
VMP00SF
bebidas | plural
VMP00PF


masc. sing. part. pass transitivo e intransitivo
bebido | singular
VMP00SM
bebidos | plural
VMP00PM


prep.
por | prep.
SPS00


pron. pess. 2 g.
você | singular
PP3CS000
vocês | plural
PP3CP000


s. 2 g.
agente | s. 2 g.
NCCS000
agentes | masc. e fem. pl. de agente
NCCP000 


s. f.
garrafa | s. f.
NCFS000
garrafas | fem. pl. de garrafa
NCFP000


s. f. | s. 2 g.
segurança | s. f. | s. 2 g.
NCFS000
NCMS000
seguranças | masc. e fem. pl. de segurança
NCFP000
NCMP000


s. m.
frasco | s. m.
NCMS000
frascos | masc. pl. de frasco
NCMP000


s. m. 2 núm.
NCMN000


v. tr.
beber | v. tr
VMN0000
VMN01S0
VMN03S0
VMSF1S0
VMSF3S0


v. tr. e intr. | v. tr.
violar | v. tr. e intr. | v. tr.
VMN0000
VMSF1S0
VMSF3S0


v. tr. | v. intr. | v. pron.
desdizer | v. tr. | v. intr. | v. pron.
VMN0000
VMN01S0
VMN03S0 


v. tr. | v. pron.
reduzir | v. tr. | v. pron.
VMN0000
VMSF1S0
VMSF3S0


################################################
################################################
################################################
################################################

marcoagpinto · April 19, 2022, 4:06am

I should write a user guide

marcoagpinto · April 19, 2022, 4:14am

Also, here is how I check against a 600 000 corpora:

java -Dfile.encoding=UTF-8 -Xmx4500M -jar languagetool-wikipedia.jar check-data -l pt-PT -r CONFUSÃO_AONDE_ONDE -f pt-PT.txt -f tatoeba-pt.txt --max-sentences 600000 --context-size 100 >12.txt

Edit: this must be done in the shell using the Wikipedia tool and you must have these two files with the wordlists.

I start with 0.txt and increase every time I make a change in the antipatterns or patterns and increase to 1.txt, then to 2.txt, etc.

Also, remember to do a:

testrules pt

or

testrules pt-br
(if you change the specific grammar.xml for pt-BR)

Basically, before committing the rules, you have a shell open in the stand-alone tool path and type:

cls
testrules pt

My advice is that you use TortoiseSVN for the commits and Notepad++ for editing files.

You can also use my tool “Proofing Tool GUI” to sort wordlists or to generate multiple multiwords.txt words of the same kind.

rjlima · April 19, 2022, 11:17am

Thanks! Next week I will look with attention to these informations and give you a feedback.