Gaia files

Ruud_Baars · September 26, 2017, 11:03am

The gaia file with frequencies is extremely limited compared with my own frequency data.
Of course I can generate a file a bit like that, but does anyone know what the numbers mean in the file? It cannot be plain counts, since the count of ‘de’ is very low, or the corpus must have been extremely small.

Would it be a problem for the dictionary creation if I generated a file with the real word count in it, up to one million or more?

dnaber · September 26, 2017, 11:20am

I’m not sure, this would need to be tested. I assume the Gaia files use a logarithmic scale. But I’m not sure if using a larger dictionary with more frequencies will actually improve the suggestions, I suggest to carefully test that.

Ruud_Baars · September 26, 2017, 11:31am

I assume the same. But I could not find specs, just referencing on referencing.
I am quite sure though that more words will surely improve suggestion order. If it were only for new popular words like ‘selfie’; the current list is 4 years old!
Nevertheless, I will test it.

Ruud_Baars · September 26, 2017, 11:44am

How are the numbers in the Gaia file transformed into the letters in the dictionary?

dnaber · September 26, 2017, 11:49am

Looks like the input number is linearly mapped to A-Z:

github.com

languagetool-org/languagetool/blob/master/languagetool-tools/src/main/java/org/languagetool/tools/DictionaryBuilder.java#L177


      
          String key = m.group(1);
          if (freqList.containsKey(key)) {
            freq = freqList.get(key);
            freqValuesApplied++;
          }
          int normalizedFreq = freq;
          if (freq > 0 && maxFreq > 255) {
            double freqZeroToOne = Math.log(freq) / maxFreqLog;  // spread number better over the range
            normalizedFreq = (int) (freqZeroToOne * (FREQ_RANGES_IN-1));  // 0 to 255
          }
          if (normalizedFreq < 0 || normalizedFreq > 255) {
            throw new RuntimeException("Frequency out of range (0-255): " + normalizedFreq + " in word " + key);
          }
          // Convert integers 0-255 to ranges A-Z, and write output 
          String freqChar = Character.toString((char) (FIRST_RANGE_CODE + normalizedFreq*FREQ_RANGES_OUT/FREQ_RANGES_IN));
          //add separator only in speller dictionaries
          if (useSeparator) { 
            bw.write(line + separator + freqChar + "\n");  
          } else {
            bw.write(line + freqChar + "\n");
          }

Ruud_Baars · September 26, 2017, 11:54am

Great; that is great. I will test with the log base 2 of the word count, multiplied with 100 and then turned into an integer.
time 100 to make the number take a large enough range to not loose too much when shorting them into an integer, which the current file suggests to use.

Ruud_Baars · September 26, 2017, 12:04pm

This routine gives 43945797 frequency values applied in 1973988 word forms for me. But that is not correct; it actually reports the number of frequencies in the list, not the number applied!
I know, it is cosmetics. But what is reported, should be correct, right?

dnaber · September 26, 2017, 5:20pm

That means that your frequency list has 43,945,797 values and 1,973,988 of those have been “applied”, i.e. 1,973,988 of those are actually in the spelling dictionary. If that’s not correct, could you provide the files you’re using?

Ruud_Baars · September 26, 2017, 6:49pm

Yes, I understand that that is the essence. Nevertheless, the English apply can only been done with something on something. Not with something on nothing. Having 1000 frequencies and 800 words of which 700 are common, would mean 700 have been applied, and 100 not applied.
The message could say x word frequency values have been read from the gaia file, and y words from the words list.

It is just wording. Nothing essential.