Issues with the software to check the rules and UTF-8

marcoagpinto · May 18, 2024, 4:46am

Hello,

Weeks ago, I opened a ticket because when I am trying to use the testing software, it corrupts the accents of the rules.

This is a recent issue.

java -Dfile.encoding=UTF-8 -Xmx4500M -jar languagetool-wikipedia.jar check-data -l pt-PT -r PÔR_FIM_À_VIDA -f pt-BR.txt --max-sentences 900000 --context-size 100 >0.txt

This is what happens:

Activating CONFUSÃO_CAIXA_EMBALAGEM[1], which is default=‘temp_off’
Activating PRAZER_EM_CONVIDAR[1], which is default=‘temp_off’
Activating PÔR_FIM_À_VIDA[1], which is default=‘temp_off’
WARNING: Could not find rule ‘PÃ”R_FIM_Ã€_VIDA’
Only these rules are enabled: [PÃ”R_FIM_Ã€_VIDA]
Working on: pt-BR.txt
Sentence limit: 900000
Context size: 100
Error limit: no limit
Skip: 0

Is there a way to fix it?

I am using Windows 11.

Thanks!

jaumeortola · May 18, 2024, 6:03am

This is probably an issue with the character encoding in your terminal. To allow diacritics, make sure that it is configured for utf-8.

Some ideas for Windows here:

You can search for other pages on the internet.

marcoagpinto · May 18, 2024, 9:37pm

I tried the chcp 65001 and it still corrupts the rule name.

Has nothing been changed in the command to check the hits in the wordlists?

I spent a few months without committing rules while the new tag system was being implemented for PT.

And now I can’t test rules with accents.

I can’t remember when I upgraded to Windows 11, if it was before or after that.