Back to LanguageTool Homepage - Privacy - Imprint

Gaia files


(Ruud Baars) #1

The gaia file with frequencies is extremely limited compared with my own frequency data.
Of course I can generate a file a bit like that, but does anyone know what the numbers mean in the file? It cannot be plain counts, since the count of 'de' is very low, or the corpus must have been extremely small.

Would it be a problem for the dictionary creation if I generated a file with the real word count in it, up to one million or more?


(Daniel Naber) #2

I'm not sure, this would need to be tested. I assume the Gaia files use a logarithmic scale. But I'm not sure if using a larger dictionary with more frequencies will actually improve the suggestions, I suggest to carefully test that.


(Ruud Baars) #3

I assume the same. But I could not find specs, just referencing on referencing.
I am quite sure though that more words will surely improve suggestion order. If it were only for new popular words like 'selfie'; the current list is 4 years old!
Nevertheless, I will test it.


(Ruud Baars) #4

How are the numbers in the Gaia file transformed into the letters in the dictionary?


(Daniel Naber) #5

Looks like the input number is linearly mapped to A-Z:


(Ruud Baars) #6

Great; that is great. I will test with the log base 2 of the word count, multiplied with 100 and then turned into an integer.
time 100 to make the number take a large enough range to not loose too much when shorting them into an integer, which the current file suggests to use.


(Ruud Baars) #7

This routine gives 43945797 frequency values applied in 1973988 word forms for me. But that is not correct; it actually reports the number of frequencies in the list, not the number applied!
I know, it is cosmetics. But what is reported, should be correct, right?


(Daniel Naber) #8

That means that your frequency list has 43,945,797 values and 1,973,988 of those have been "applied", i.e. 1,973,988 of those are actually in the spelling dictionary. If that's not correct, could you provide the files you're using?


(Ruud Baars) #9

Yes, I understand that that is the essence. Nevertheless, the English apply can only been done with something on something. Not with something on nothing. Having 1000 frequencies and 800 words of which 700 are common, would mean 700 have been applied, and 100 not applied.
The message could say x word frequency values have been read from the gaia file, and y words from the words list.

It is just wording. Nothing essential.