Brazilian contribution (GSoC 2018)

tiagosantos · February 15, 2018, 8:07pm

You catch me in a particularly busy period, but I will try to assist as possible, though probably not every day. I believe that the Portuguese section needs the most is a bit polish.

There are many typos that need to be fixed, urls that need a review, rule name inconsistencies, imprecisions, etc. These would be the very easy tasks, but not necessarily about coding. However, these useful tasks can introduce you to the metalanguage used by LT, without losing time with futile exercises.
Then there is the need to make Brazilian Portuguese have the same coverage as European Portuguese. The vast majority of the rules me and Marco added work for all Portuguese languages, but there were some language specific rules that I added only to European Portuguese. I have only converted to pt-BR a couple of those dialect related rules. This would be another easy task, but great to see if you master LT logic and the metalanguage.
Then there are a few ideas that can apply specifically to Portuguese. For example, making a <message> simple word ‘translator’, so that strings in the main grammar.xml (shared by all variants) can be used with pt-BR, pt-AO and other variants, without having to copy the rules from the main branch, just to change a word in the message. This would need java knowledge to interface with the existing API.
The other short term project, that I have already started, but given my availability, is tending to take long, is to add a decent list of confusion pairs to the word2vec models.

This is a more difficult task: Create a word extractor that uses hunspell logic to find words similar with dictionary words. To make my idea easier to understand. Create a list with all hunspell words. Consider each word in the list a ‘misspelled’ word. Which would be the first word suggested as a replacement? That is the confusion pair. The biggest issue is doing so with the already existing criteria given in the affix file, and do all combination in an efficient manner, since there are too many combination to test.

This development tool will be useful for all languages.
Also for all languages you can check this:

There are some good suggestion from various commiters, including one from me (hopefully not totally unreasonable).
I reaffirm the need to create a metalanguage for chunkers, that can mimic Freeling, but that has to be developed independently, for licensing reasons. This would be the most time consuming task, but you would need to use your NLP skills in addition to Java. You wouldn’t have to worry that much about the chunker in itself, just on the mechanism to group sintatic group into chunks, through a simple text metalanguage.

Other important tasks, but potentially frustrating to work on:
make the phrase system work as intended (messes up rules and doesn’t work in all logical situation),
fix multiple rule creation with and/or tags,
multiplatform GUI improvements,
port all metalanguage code options used in grammar.xml to disambiguation.xml.

Hope this gives some food for thought.
Best regards,

Tiago Santos

P.S. - Oh! And hopefully, welcome to the project!