I am working on a auto correction software (Albanian language) based on a dictionary.
On the basis of the electronic dictionary database we have built the tables which retain all the automatic results of programming the paradigms of the words that change shape due to the inflection and conjugation.
One word such as worker (Albanian:punëtor) can have many forms as per its declension as a noun (feminine/masculine, singular/plural, etc), ex: punëtori (singular masculin), punëtorja (singlular feminine), punëtorët (pl. Masc), punëtoret (pl. Fem.).
The word coworker (bashkëpunëtor) is compound by conjunction of word “co” (bashkë) and worker (punëtor), for each of the forms above.
One word has its forms given in a one-to-many relation
In the meantime, one word can be part of a conjunction with more than one word (such as verbs work can have many forms due to inflection (Albanian: bashkë -punoj, punon).
The list of words is unpredictable, it does not have a known number, neither does it have a formula.
A compound word can be a word+another compound word
Zëvendës+kryeministër = zëvendëskryeministër (Deputy Prime Minister)
Kryeministër = krye+ministër (Prime Minister)
We already have the list of words in a dictionary and their forms in one-to-many relation.
We know how to structure this scenario via table joins but since the combinations result in a huge list, the decision making/suggestion algorithm performance is unacceptably slow.
There are exception cases which the result of conjunction differs from word A and B. In Albanian language there are words that always need a particle in front, but the result of conjunction does not. Ex: i verdhë (yellow) always needs as a MUST rule particle (i/e/të/së) but in conjunction with sy (eye), the word syverdhë changes its form.
Syverdhë-someone who has yellow eyes
Every word has a one to many relation based on a template, each form is tagged so to form a group of words which have the same behavior. In this case, the result must change its behavior and must be grouped in a different tag.
The Complexity is O(N*M) where:
N-is the number of words in text
M-is the number of forms in Elastic search (inflection plus conjugation) - as a result we would have a Cartesian product, which should be optimized to check P-words from M-in total, (which have the length of the word).
Is there any tool in order to identify and suggest the correct form of the compound words (with or without adding the compound forms in dictionary)?