Italian rules and dictionary

Mauro · August 31, 2012, 4:01pm

Hi,
I’m implementing an eclipse-based document editor and I’m currently testing LT as Spelling Engine.

I’m Italian, so my principal focus is on Italian and English.
Using the defaults, with all rules enabled, finds a lot of false positives, to the point all sentences in my test document had some error.

These come from essentially from:

Missing words (this is strange because OO/LO hunspell dict does include those words!)
Missing words (again, but this time LT is right: these are names not in the vocabulary)
Check verb tenses (lots of them, all false positives)
Common misspellings (false positives)

I know I can completely disable the “faulty” rules, but I’m interested in working together with Your Italian developer to refine them, if possible.

I did not find a way to set “ignore” to some words or add them to some ckind of “user dictionary”.
Can someone point me in the right direction?

TiA
Mauro

dnaber · August 31, 2012, 4:42pm

Hi Mauro,

thanks for your feedback!

These come from essentially from:

Missing words (this is strange because OO/LO hunspell dict does include
those words!)

Could you post some examples?

I know I can completely disable the “faulty” rules, but I’m interested in
working together with Your Italian developer to refine them, if
possible.

That’s great! The best place for development discussion is our mailing
list, I suggest you subscribe there:

I did not find a way to set “ignore” to some words or add them to some
ckind of “user dictionary”.

In the development version (1.9-dev, to be released as 1.9 in 4 weeks), you
can add those words to the file resources/it/hunspell/ignore.txt. In the
long term, it’s probably a good idea to also send them to the dictionary
maintainer.

Regards
Daniel

–
http://www.danielnaber.de

Mauro · August 31, 2012, 6:22pm

I’m on the list.

I will not be at home next two weeks (3-15 Sept.), but after that I will be available.

Missing words: “avea”, “farfalletta”. (your Italian expert will know where they came from ).

As for the “ignore” thing:
I do agree it would be good to upstream changes (if You can point me to the right place), but only for missing words.

There are two other kinds of “misspelling” that do not belong upstream:

Uncommon or non-Italian names. Should really be ignored. It would be good to have some kind of API so it would be possible for the user to say “ignore all” and/or “add to dictionary” while spell-checking.
Strange combinations of non-word characters (e.g.: “—”). These should be filtered out by the LT tokenizer.

Regards
Mauro

dnaber · August 31, 2012, 11:17pm

Am Fr 31.08.2012, 11:22:19 schrieben Sie:

Missing words: “avea”, “farfalletta”. (your Italian expert will know
where they came from ).

The word list used in LT is this one from Andrea Pescetti:
Version 3.3.1, 24-Mar-2011

At Italian dictionary, thesaurus, hyphenation patterns | Apache OpenOffice Extensions a more recent
version is available - could this be the reason the words are not included
in LT? then we could just update the word list.

Uncommon or non-Italian names. Should really be ignored. It would be
good to have some kind of API so it would be possible for the user to
say “ignore all” and/or “add to dictionary” while spell-checking.

Yes, it’s on my TODO list but I cannot promise this will make it into the
next version.

–
http://www.danielnaber.de

Mauro · September 1, 2012, 12:10pm

Hi Daniel,
comments below.

At Italian dictionary, thesaurus, hyphenation patterns | Apache OpenOffice Extensions a more recent
version is available - could this be the reason the words are not included
in LT? then we could just update the word list.

I downloaded the new version from the link you gave me, but it seems to be in a different format.
I found some references to “fsa” and I will follow, but… if You could point me to the right place…

Yes, it’s on my TODO list but I cannot promise this will make it into the
next version.
It is unclear to me if LT actually uses hunspell “under the hood”; if this is the case, I would suggest to provide a thin wrapper around hunspell functions.
This would have (at least) three immediate benefits:

usage of multiple dictionaries; my long-term plan is to support three dictionaries for a single language: standard hunspell, application/user and local to document.
you could defer computation of “suggested replacements” to the moment when spelling dialog on a single word is actually opened; this would speed a lot spell checking of large documents avoiding computation of data the user might never use.
You would have a thesaurus “for free”.

The thin wrapper could be general enough to accommodate for other Spelling Engines.
In my spare time (almost a joke!) I could try to help.

Regards
Mauro

dnaber · September 1, 2012, 12:43pm

Am Sa 01.09.2012, 05:10:40 schrieben Sie:

Hi Mauro,

I downloaded the new version from the link you gave me, but it seems to
be in a different format.

yes, it’s the original hunspell format. You can try to use the unmunch tool
(should be part of hunspell) to expand the *dic and *aff to a plain word
list.

It is unclear to me if LT actually uses hunspell “under the hood”;

It depends on the language - for German we use Hunspell, for most other
languages we don’t. This is because hunspell is too slow for our use case
when it creates the suggestions for misspellings. (For German we need
support for compounds so the plain list approach doesn’t work)

you could defer computation of “suggested replacements” to the moment
when spelling dialog on a single word is actually opened; this would
speed a lot spell checking of large documents avoiding computation of
data the user might never use.

Well, this is a question of the user interface… LT can be used as a tool
with a graphical user interface, but it is also an API whose users want a
fast and complete response, including suggestions. Thus deferring the
results would also need to happen for the API, extending it so another call
can start fetching the suggestions.

Regards
Daniel

–
http://www.danielnaber.de

Mauro · September 1, 2012, 2:06pm

Hi Daniel,
comments below

It is unclear to me if LT actually uses hunspell “under the hood”;

It depends on the language - for German we use Hunspell, for most other
languages we don’t. This is because hunspell is too slow for our use case
when it creates the suggestions for misspellings. (For German we need
support for compounds so the plain list approach doesn’t work)

I take this to mean hunspell is not used for Italian, correct?
Anyways, if You accept the suggestion below, this could be made configurable, possibly even with runtime (Preference) configuration.

you could defer computation of “suggested replacements” to the moment
when spelling dialog on a single word is actually opened; this would
speed a lot spell checking of large documents avoiding computation of
data the user might never use.

Well, this is a question of the user interface… LT can be used as a tool
with a graphical user interface, but it is also an API whose users want a
fast and complete response, including suggestions. Thus deferring the
results would also need to happen for the API, extending it so another call
can start fetching the suggestions.

This is exactly what I’m driving at.
AFAIK there is no API in LT to request checking without generating suggestions or to request suggestions for a single word (well, this could be forced doing a check(wrongWord), but I didn’t try).
In this condition there’s nothing the UI can do. It is all-or-nothing.

Unless there’s some hardwired reason why this is unpractical, I would suggest to unbundle check(document) into review(document) and suggest(word), we could also have a wrapping function: reviewAndSuggest(document) (== check(document)) that computes errors and then loops over misspellings.

Regards
Mauro

dnaber · September 1, 2012, 4:44pm

Am Sa 01.09.2012, 07:06:04 schrieben Sie:

Hi Mauro,

I take this to mean hunspell is not used for Italian, correct?

that’s right.

Anyways, if You accept the suggestion below, this could be made
configurable, possibly even with runtime (Preference) configuration.

I’m not a fan of too many configurations… instead we should find the best
solution that works okay for everybody.

I suggest that you send your ideas to the mailing list, as more developers
will be involved then.

Regards
Daniel

–
http://www.danielnaber.de