Most “text” files nowadays are actually unicode, 16-bit files (“wide” characters).
Running these through LanguageTool produces garbage, like
Message: This sentence does not start with an uppercase letter
Suggestion: €œNo
… “No.â€? …
^^^^^^
Do you know any command-line tools for doing the most-common conversions (e.g., left and right quotation marks to quotation marks, long dash to dash, the “…” character to “…”, stupid stuff like that)?
The Perl module Text::Unidecode is supposed to do this, but unfortunately it does not handle converting left-quote or right-quote marks to quote marks, which is about 95% of what I need to do.
This seems to do most of what I need… but I don’t like having those weird characters in my source code, sure would like to figure out the escaped way of saying them.
filename: unidecode.pl
use Text::Unidecode;
my @lines;
while (my $line = <>) {
$line =~ s/“/"/g;
$line =~ s/â€/"/g;
$line =~ s/’/’/g;
my $ascii = &unidecode($line);
print $ascii;
}
What are the available values for -c or --encoding, how can I list them?
Also, why there is no autodetection of UTF-8 with BOM and UTF-16 LE/BE with BOM? They can be easily detected by the leading “BOM” bytes: EF BB BF for UTF-8, FF FE for UTF-16 LE and FE FF for UTF-16 BE. Also there are FF FE 00 00 for UTF-32 LE and 00 00 FE FF for UTF-32 BE (though personally I did not see any file in UTF-32 yet).