Suggestions for 16-bit (wide, utf8, unicode) files?

philgoetz · May 12, 2012, 6:15pm

Most “text” files nowadays are actually unicode, 16-bit files (“wide” characters).
Running these through LanguageTool produces garbage, like

Message: This sentence does not start with an uppercase letter

Suggestion: Â€œNo

… â€œNo.â€? …
^^^^^^

Do you know any command-line tools for doing the most-common conversions (e.g., left and right quotation marks to quotation marks, long dash to dash, the “…” character to “…”, stupid stuff like that)?

philgoetz · May 12, 2012, 7:08pm

The Perl module Text::Unidecode is supposed to do this, but unfortunately it does not handle converting left-quote or right-quote marks to quote marks, which is about 95% of what I need to do.

philgoetz · May 12, 2012, 7:12pm

This seems to do most of what I need… but I don’t like having those weird characters in my source code, sure would like to figure out the escaped way of saying them.

filename: unidecode.pl

use Text::Unidecode;

my @lines;
while (my $line = <>) {
$line =~ s/â€œ/"/g;
$line =~ s/â€/"/g;
$line =~ s/â€™/’/g;
my $ascii = &unidecode($line);
print $ascii;
}

usage: perl unidecode.pl < input > output

dnaber · May 12, 2012, 7:20pm

On Samstag, 12. Mai 2012, you wrote:

Most “text” files nowadays are actually unicode, 16-bit files (“wide”
characters).
Running these through LanguageTool produces garbage, like

Using the -c or --encoding option you can specify the encoding of the
input files.

Regards
Daniel

–
http://www.danielnaber.de

philgoetz · May 13, 2012, 4:23am

Thanks! Sorry. I should have seen that.

DV1 · August 13, 2014, 9:51am

What are the available values for -c or --encoding, how can I list them?
Also, why there is no autodetection of UTF-8 with BOM and UTF-16 LE/BE with BOM? They can be easily detected by the leading “BOM” bytes: EF BB BF for UTF-8, FF FE for UTF-16 LE and FE FF for UTF-16 BE. Also there are FF FE 00 00 for UTF-32 LE and 00 00 FE FF for UTF-32 BE (though personally I did not see any file in UTF-32 yet).

dnaber · August 13, 2014, 11:48am

You cannot list them, but we support all encodings that Java supports. That should include all common encodings.

Because nobody has programmed that yet, patches welcome.