Detect zeros instead of "o" for english

PashaTurok · June 19, 2017, 9:25am

Hi all
I have the following sentence

I don’t like th0se books dfdfdfdf

As you see in word “th0se” zero is instead of “o”. However, checktool shows error only in last word “dfdfdfdf”. I have a lot of zeros after ocr recognition and want to make language tool to detect these zeros. Is it possible?

jaumeortola · June 19, 2017, 9:58am

The current configuration in English ignores every word containing a digit. This behavior has pros and cons… To change this behavior you need to add this line:

fsa.dict.speller.ignore-numbers=false

in this file (for American English):

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/en_US.info

#
# Dictionary properties.
#

fsa.dict.separator=+
fsa.dict.encoding=utf-8

fsa.dict.encoder=SUFFIX

fsa.dict.speller.locale=en_US
fsa.dict.speller.ignore-diacritics=true
fsa.dict.speller.replacement-pairs=ninties 1990s, a ei, ei a, a ey, ey a, ai ie, ie ai, are air, are ear, are eir, air are, air ere, ere air, ere ear, ere eir, ear are, ear air, ear ere, eir are, eir ere, f ph, ph f, gh f, f gh, kw qu, Bordo Bordeaux, bato bateau, bocoup beaucoup, buro bureau, bo beau, oo ew, ew oo, ew ui, ui ew, oo ui, ui oo, uff ough, oo ieu, ieu oo, ier ear, ear ier, air ear, shun tion, shun sion, shun cion, yersa years, phoby phobia
fsa.dict.frequency-included=true
fsa.dict.speller.ignore-all-uppercase=false
fsa.dict.speller.ignore-camel-case=false

I think you can solve your problem easily using a text editor. Replace any ‘0’ (zero) joined to a letter with a ‘o’. Use a regular expression like this: replace “(\w)0” with “$1o”, and replace “0(\w)” with “o$1”.

pep.bofarull · June 20, 2017, 6:49am

Good idea Jaume, after replaced words using text editor, we can check again the text with LT. I don’t know if OCR recognition can use an advanced corrector, only a dictionary or do nothing.

SkyCharger001 · June 20, 2017, 11:01am

nitpicks:
A. what about zeros in a subscriptless rendition of chemical formulas ? I’ve once seen a formula that had an element ten times per molecule.

B. try the following sentence: “The system has only f0e bytes of ram in total.” treat it as a typo and you’ll get a nonsense phrase, but treat it as hex/3835 and it makes a lot if sense.

jaumeortola · June 20, 2017, 11:46am

You’ll need to supervise the corrections anyway. You cannot do it automatically. Either with the regexp strategy or with the spell-checker (properly configured), you have to oversee the changes. There is no magic bullet.

SkyCharger001 · June 20, 2017, 11:48am

That’s one of the reasons why I call them nitpicks.