Back to LanguageTool Homepage - Privacy - Imprint

Suggestions for 16-bit (wide, utf8, unicode) files?


(philgoetz) #1

Most "text" files nowadays are actually unicode, 16-bit files ("wide" characters).
Running these through LanguageTool produces garbage, like

Message: This sentence does not start with an uppercase letter

Suggestion: €œNo

... “No.� ...
^^^^^^

Do you know any command-line tools for doing the most-common conversions (e.g., left and right quotation marks to quotation marks, long dash to dash, the "..." character to "...", stupid stuff like that)?


(philgoetz) #2

The Perl module Text::Unidecode is supposed to do this, but unfortunately it does not handle converting left-quote or right-quote marks to quote marks, which is about 95% of what I need to do.


(philgoetz) #3

This seems to do most of what I need... but I don't like having those weird characters in my source code, sure would like to figure out the escaped way of saying them.

filename: unidecode.pl

use Text::Unidecode;

my @lines;
while (my $line = <>) {
$line =~ s/“/"/g;
$line =~ s/”/"/g;
$line =~ s/’/'/g;
my $ascii = &unidecode($line);
print $ascii;
}

usage: perl unidecode.pl < input > output


(Daniel Naber) #4

On Samstag, 12. Mai 2012, you wrote:

Most "text" files nowadays are actually unicode, 16-bit files ("wide"
characters).
Running these through LanguageTool produces garbage, like

Using the -c or --encoding option you can specify the encoding of the
input files.

Regards
Daniel

--
http://www.danielnaber.de


(philgoetz) #5

Thanks! Sorry. I should have seen that.


(DV) #6

What are the available values for -c or --encoding, how can I list them?
Also, why there is no autodetection of UTF-8 with BOM and UTF-16 LE/BE with BOM? They can be easily detected by the leading "BOM" bytes: EF BB BF for UTF-8, FF FE for UTF-16 LE and FE FF for UTF-16 BE. Also there are FF FE 00 00 for UTF-32 LE and 00 00 FE FF for UTF-32 BE (though personally I did not see any file in UTF-32 yet).


(Daniel Naber) #7

You cannot list them, but we support all encodings that Java supports. That should include all common encodings.

Because nobody has programmed that yet, patches welcome.