[pt] Some false positives in my thesis - 2016-12-16

marcoagpinto · December 16, 2016, 11:05pm

Hello @tiagosantos

Only now I had the chance of making some “real” tests on my thesis using the nightly from tonight.

Here are some improvements that could be done.

Please notice that here are only the Roman pages. On Monday I will try to check starting in chapter 1.

Thanks!

#1
“e o Nuno Leitão que é meu amigo no Facebook e Skype.”
LT suggests “que é o meu amigo”.

#2
“À comunidade Mozilla (Firefox, Thunderbird e SeaMonkey)”

#3
“À Eduarda Guimarães dos CTT pelas dicas,”

#4
“, e as suas dicas foram preciosas.”

#5
“Ao Coronel Desidério Manuel Vilas Leitão,”
There seems to be some issues with proper names.
Maybe I should add them to morphologic as names? Is this the possible solution? How do I do it?

#6
“Por eles nutro um grande respeito”

#7
“É o acto ou a intenção com o objectivo de causar danos a Estados soberanos, a forças armadas,”

#8
“O uso sistemático da agressão ou da contra agressão por parte de”

#9
“sistemas informáticos, redes e a respectiva Informação neles armazenada”

tiagosantos · December 17, 2016, 4:44pm

Hi Marco,

This is very good.

I have seen this cases before. I have not addressed them yet, since they are a small minority, and I have not figured out a good anti-pattern to deal with them. I will try the simple exception for all inflected forms of ‘ser’ in the preview token, but that may not be a solid exception.

The morphological needs those names added/fixed. You can add it to added.txt, or if you feel confident in changing the binary, go head. That I a very needed project. Please, create a repository for it, so we can track changes and review what is missing or wrong.
You need to rebase first on:

If you take this task, you may want to fork it. It is the project that has the source morphological dictionary, with many updates in this last 4 years. It has not been massively changed, but some issues have already been fixed there, and many more words are recognized, specially words that changed in AO90.

A simple update will fix most of these errors.

You will have to run a conversion script from other language, as for example:
./languagetool-language-modules/gl/src/main/resources/org/languagetool/resource/gl

and them turn to binary via Developing a tagger dictionary - LanguageTool Wiki

Before submitting the new binary to repo, please, share the source link and allow review so we can avoid breaking existing rules or having needless regressions.

All those are important, and they will need more exceptions added to the specific rules. Fortunatelly, they are easy to solve.

This weekend I will not be working on this, but I these issues will be resolved ASAP.

tiagosantos · December 18, 2016, 9:55pm

Pushed some fixes to git.

I had to review one of your rules to fix the false negative in "“O uso sistemático da agressão ou da contra agressão por parte de”. I wish to improve many rules that you created, before release. I have left some comments on the code.

Since you seam to be time constrained lately, can I review and improve them? I will not add credits to the base rules you create.

I also need to know ASAP if you are going to update the morphological dictionary as I discussed in the previous post. I prefer to avoid any binary related tasks, but if you do not do it, I will have to do it, before release date.

marcoagpinto · December 19, 2016, 12:05am

Tiago, I am not sure when I will be able to do it myself, feel free to change/improve as you wish.

Kind regards,

dnaber · December 19, 2016, 8:12am

Please also consider that there’s a feature freeze from tomorrow, i.e. no bigger changes should be committed after that to make sure the release is stable.

tiagosantos · December 19, 2016, 3:41pm

@marcoagpinto
Many thanks. I will see what I can do in the meantime. If you see some controversial change, just tell.

@dnaber
I am aware of that date. I had planned to do review work on the XML parts on that period. That is one of the reasons I have been making massive commits lately.
I plan to standardized rules display, group rules (so the options areeasier to navigate and make more sense), make some rule name improvements, standardize message and marker tag usage and improve categorization. If I see ways of improving existing rules by reducing false positives, I believe this would fit the tasks for this period, so I would also do it.
I was also planning on having the morphological dictionary, while maintaining the content of the correction files added.txt and removed.txt. Since I will diff the version in use with the new version from freeling, it will be possible to evaluate the impact of this change. Viable?

This should be enough.

dnaber · December 19, 2016, 4:05pm

I leave it up to you to decide what’s a bug fix or low risk change. Just keep in mind that we don’t need to put everything in this release, the next release will already be in three months. And we should really avoid bad bugs, because making a bug fix release like 3.6.1 is quite some work for me, I’d like to avoid it.

tiagosantos · December 19, 2016, 5:47pm

Thank you for the vote of confidence.
I consider all these changes just the “polishing” part. I am not expecting regressions, but on this period I will be more rigorous with the changes I push. The “strategy” I used on this release was:

make big changes first;
allow testing, complains or bug reports to arise;
adjust accordingly;
and finally, tighten everything, and polish loose ends. This is the part I reserved for the ‘feature freeze’, since, literally, no new feature will be implemented.
I will stop making any type of changes 2 days before release. This should safeguard any odd regression like tabs instead of spaces, a lost signal that blocks a section, changes in automated tests, etc.

tiagosantos · December 20, 2016, 12:52am

@marcoagpinto

I have finished the Freeling fork, and updated the binaries. I began testing and so far, all seams good. If you have the opportunity, test the files in FreeLing/LanguageTool/pt at master · TiagoSantos81/FreeLing · GitHub
You can see the history to review the changes and source file manipulations.

The readable data list in use in the new POS dictionary and Synthetizer is in portuguese.dict.txt. You can use gitk DAG function to view the diffs between commits, since I made a base dump of the dictionary used in LT for comparison.

Unless you raise any relevant issues with this new file or amendments are needed, I will push this version on Saturday. Note that I have adapted the Freeling tags so that we do not have to change LT rules. All build and rule tests pass when dict files from commit af6711d are used in LT.

marcoagpinto · December 20, 2016, 10:25am

Good work, @tiagosantos

I can only test in the field after I have a nightly with the changes, so that I can open my thesis again and this time scroll through the whole 291 pages.

tiagosantos · December 20, 2016, 8:58pm

Ok. Then I push the changes tomorrow. I already pushed strings and rule group improvements today, and I need to confirm if there are any regressions. This change sets the deadline for morphological dictionary review as Saturday 24th.

dnaber · December 20, 2016, 9:08pm

Off topic: I suggest using the LT browser add-on, it would have spotted this error (them/then)

tiagosantos · December 20, 2016, 9:59pm

It would. The plugin is really awesome! Added to the TODO list.

tiagosantos · December 21, 2016, 12:37am

@marcoagpinto

The dictionary update and required changes to LT have been pushed. Next nightly should have the changes for you to test. I recommend making a dump of the dictionary, so that you can double-check its contents.

marcoagpinto · December 27, 2016, 8:56am

@tiagosantos

The following words give a false positive:
“UNIVERSIDADE TRÁS-OS-MONTES E ALTO DOURO”

It suggesting changing “trás” to “traz”.

tiagosantos · December 27, 2016, 3:15pm

Fixed. In these cases, the best way is to add a restricted antipattern to the affected rule. For one example, see:

marcoagpinto · December 27, 2016, 10:24pm

@tiagosantos
@dnaber

LanguageTool complains that “DE” doesn’t start in uppercase (see screenshot).

Can it be fixed before the official release date?

Thanks!

dnaber · December 27, 2016, 10:48pm

The release will be tomorrow morning, i.e. in a few hours, so please stop changing stuff…

tiagosantos · December 27, 2016, 11:12pm

This is related to the generic upper case rule, so this false positive is here for at least the last couple of years.
Anyway, as Daniel referred, it is already too late to fix anything, except if it is to fix something that breaks the build. On that sense, everything seams perfect and ready for release.

tiagosantos · December 28, 2016, 12:39am

Odd. I tested the latest daily build, and I can not reproduce that issue.
No false positives for any capitulated words, including that specific title. Everything seams to be working as intended in that rule.
Please upload a document with that false positive for analysis, and register this issue in Github bug tracker, as a reminder for later work.