Finding dictionary entries in a mixed language scenario

garry · October 11, 2016, 4:25pm

Hi,

imagine a fragment of mixed language text as such:

Mes teuren freunde, ajuda me con this!

For each word I would like to find a dictionary entry in either the French, the Spanish, the German or the English dictionary resources bundled or nothing. A pretty close if not exact match to any one “flexion” of an entry would be fair enough.

Using language tool, well, easily I can find wrongly spelled words given a language. Am I taking the right choice for this complementary case that is given a text finding words of a language? How would you address this?

Thx

dnaber · October 11, 2016, 6:29pm

LT doesn’t really support this use case. I suggest using a spell checker directly. If you still want to use LT for convenience, you could turn off all rules except spell checking and check the text word by word, with one language after the other. This would be very slow I guess.

jaumeortola · October 16, 2016, 8:05am

Hi,

I am also interested in this feature and I’m doing some experiments right now. I would like to remove sentences in other languages in a corpus.

Depending of your needs, you can use a language detection library or simply a spell checking dictionary or a combination of both.

A language detection library like CLD2 is fast, but it is not good enough for short sentences.

garry · October 16, 2016, 10:14am

Thank you Daniel. Performance is not an issue, yet. I ended up here because I was thinking to employ Hunspell for the accessibility of its dictionaries and I was hoping to find a Java port instead of JNI/JNA bindings.

In the course may I ask another question:

Using the anticipated check API call against the Spanish with the word “NANOFIT” I get an empty list of suggestions. I conclude this word is not found in the dictionary and way off any one plausible suggestion. Fair.

Instead taking the highly frequent spanish word “sol” I receive no list of suggestions. I conclude the word is in the dictionary. Fair.

However taking “GymManager”, “Foam240”, “Me:You” (clearly not Spanish words) I receive no list of suggestions just as well. This confuses me as I had expected an empty list indicating non-existence of these words in the Spanish dictionary.

Are these words considered unique names assuming an intentional mispelling due to CamelCase etc.? By Hunspell or by LT? Is this behaviour configurable? Actually, in my case I would like “Gym”, “Manager”, “Foam”, “240” to be dealt with as individual tokens.

Again thanks for the help!

dnaber · October 16, 2016, 10:45am

The configuration is documented at Spell check - LanguageTool Wiki

garry · October 16, 2016, 10:46am

Thx Jaume. I look at individual and very short sentences only that contain words of multiple languages. Maybe an alternative would be in indexing a “wiktionary” dump using some Lucene derivate (one for each language). However I have no clue regarding wiktionary’s coverage and the complementary need to compensate low coverage with rules and analyzers for languages I do not even speak. Hunspell has got about the right accuracy (precision/recall) for me…

garry · October 16, 2016, 11:03am

Ups, yes, thx, this is exactly what I looked for Another newbie question then, please: I feasibly fetched the core and the language resources via Maven. Configuration such as the info files you pointed to sit in jars of my local Maven repository hence. Is there a way to overwrite the configuration from within the linking application? Or must I download the source and build my own resource jars? Any tip? Thx

dnaber · October 16, 2016, 1:37pm

Unless there’s a way to overwrite values via API (I’m not sure) I think you’ll need to get and build the source.

garry · October 21, 2016, 12:51pm

Thx for the help! Appreciated