Autocorrecting misspelled Words- Compound words based on a dictionary

LeonardBodurii · October 2, 2019, 5:23pm

I am working on a auto correction software (Albanian language) based on a dictionary.

On the basis of the electronic dictionary database we have built the tables which retain all the automatic results of programming the paradigms of the words that change shape due to the inflection and conjugation.

One word such as worker (Albanian:punëtor) can have many forms as per its declension as a noun (feminine/masculine, singular/plural, etc), ex: punëtori (singular masculin), punëtorja (singlular feminine), punëtorët (pl. Masc), punëtoret (pl. Fem.).

The word coworker (bashkëpunëtor) is compound by conjunction of word “co” (bashkë) and worker (punëtor), for each of the forms above.

One word has its forms given in a one-to-many relation

In the meantime, one word can be part of a conjunction with more than one word (such as verbs work can have many forms due to inflection (Albanian: bashkë -punoj, punon).

The list of words is unpredictable, it does not have a known number, neither does it have a formula.

A compound word can be a word+another compound word

Zëvendës+kryeministër = zëvendëskryeministër (Deputy Prime Minister)

Kryeministër = krye+ministër (Prime Minister)

We already have the list of words in a dictionary and their forms in one-to-many relation.

We know how to structure this scenario via table joins but since the combinations result in a huge list, the decision making/suggestion algorithm performance is unacceptably slow.

There are exception cases which the result of conjunction differs from word A and B. In Albanian language there are words that always need a particle in front, but the result of conjunction does not. Ex: i verdhë (yellow) always needs as a MUST rule particle (i/e/të/së) but in conjunction with sy (eye), the word syverdhë changes its form.

Syverdhë-someone who has yellow eyes

Every word has a one to many relation based on a template, each form is tagged so to form a group of words which have the same behavior. In this case, the result must change its behavior and must be grouped in a different tag.

The Complexity is O(N*M) where:
N-is the number of words in text
M-is the number of forms in Elastic search (inflection plus conjugation) - as a result we would have a Cartesian product, which should be optimized to check P-words from M-in total, (which have the length of the word).

Is there any tool in order to identify and suggest the correct form of the compound words (with or without adding the compound forms in dictionary)?

Thnx!

Ruud_Baars · October 3, 2019, 6:02am

Please have a look at the Hunspell functions for compounding and flexing. The documentation is compact and sometimes rather hard to follow, but Hunspell works like a charm once the .dic and .aff are well tuned. I did that about 10 years ago for Dutch and might be able to help you a bit.
Not knowing even one word of Albanian though.

What might be of interest is not trying to be complete, since no words list is ever complete, and some compounds might be wrong forms of other. Have an eye on word frequencies too.

You can contact me at info at taaltik.nl