Spellchecker, words status (again)

Ruud_Baars · August 27, 2017, 9:13am

I really feel spell checking is more than good/wrong. Statuses might be (as far as a single word is concerned) :

certainly wrong
officially correct
and lots of variants in between, which might require a kind of info to explain what is the matter. There might be several categories here: like ‘considered insulting’, ‘considered very formal’, ‘considered very informal’, 'considered ‘archaic’, ‘can be correct, but often a mistake for ####’.

Am I the only one thinking along these lines?

tiagosantos · August 27, 2017, 1:27pm

Remember the priority system, create by Jaume, that I told you about?
It does that.

Look at the example.
HUNSPELL_RULE - the spellchecker class that [pt] uses - only reports if no other rule catches the word before it, and provide a more fitting advice. The best thing about it is that you can disable that advice or add extra advices/analysis, by enabling/disabling related rules.

You can get fairly decent lists from wikipedia and wiktionary.

  @Override
  public int getPriorityForId(String id) {
    switch (id) {
...
      case "PROFANITY":                 return -6;
      case "PT_MULTI_REPLACE":          return -10;
      case "PT_PT_SIMPLE_REPLACE":      return -11;
      case "PT_REDUNDANCY_REPLACE":     return -12;
      case "PT_WORDINESS_REPLACE":      return -13;
      case "PT_CLICHE_REPLACE":         return -17;
      case "CHILDISH_LANGUAGE":         return -25;
      case "ARCHAISMS":                 return -26;
      case "INFORMALITIES":             return -27;
      case "PUFFERY":                   return -30;
      case "BIASED_OPINION_WORDS":      return -31;
      case "WEAK_WORDS":                return -32;
      case "PT_AGREEMENT_REPLACE":      return -35;
      case "HUNSPELL_RULE":             return -50;
...
    }
    return 0;
  }
}

Don’t be intimidated because you will not have to code anything more complex than what you already do in Java. This is really just an encapsulated item list.

Ruud_Baars · August 28, 2017, 6:20am

I guess this is not about spellchecking. I was not talking about priorities, just about differnent statusses for spellchecking. Just black-whits is not good enough. There is at least : known to be correctly spelled (e.g. speed), known to be incorrectly spelled (spead) or in just not known. This might require at least lists: one of words correctly spelled ,one with words known to be incorrect; words in the latter and not in both could be used to generate suggestions. But there might also be advice for words known.
So I myself would plea for just one list, with statuses correct, incorrect and either with remarks (and or suggestions)

Discostu · August 28, 2017, 7:11am

For German, we do this with rules in grammar.xml. This way users do get a special message for colloquialisms, gender neutrality and for words that have more than one correct spelling but one of them being preferred by the Duden dictionary.

But of course writing a whole rule for every word isn’t very efficient and something like this would be much more handy:

Entered word…suggested word(s)…category
Hard Disk…Harddisk…Duden suggestion
geil…gut, großartig, toll…Colloquialisms

And then a pre-defined message for each category.

Knorr · August 28, 2017, 4:13pm

From a more abstract level, a word is either correct (= contained in the dictionary) or wrong (= not in the dictionary).
For style issues, gender neutrality etc. one has to write one ore more rules (and each rule can have a different priority). There are good example of very generic rules which do not require a one-rule-per-word approach (e.g., gender neutrality in German)

If you do not want to provide a suggestion you could also write another rule where you collect all inappropriate words and add a very generic message (e.g., “The word \1 is not acceptable in formal writing”)

@Discostu: For your example “Hard Disk” → “Harddisk” you might add it to compounds.txt which is a generic rule to handle (non-)hyphenation of words.

Ruud_Baars · August 28, 2017, 4:24pm

I completely disagree every list has a certain number of words, the words known.But every language is flexible enough to generate words when needed. Every day new words appear. This means there is a dictionary with words that are correct, maybe a list fir the ones that are wrong and a huge list of wirds without a known status. And for Dutch, there is an offucal spelling, but that one only applies to government and education. Others may use other spellings, and they do.

tiagosantos · August 28, 2017, 4:53pm

I am not sure I understand what you mean, or if there is a clear concept, because it seems to keep changing.

Anyway:

… you want a list of all (~a great deal) compounds that are valid for your language, following common affixation rules. Fast and hackish could be done this way:

Get the dict, for example, here:
nl_NL - libreoffice/dictionaries - main, development dictionaries repository
unmunch nl_NL.dic nl_NL.aff or…
*** Add the hunspell rule with all baseform with the .dic file with this format: word/[all affixes letters]\n use the same .aff
In the hunspell rule name the error string with something such as “Not a possible Dutch word”.
As then set the Morphologic priority to higher than this general hunspell rule.

Anyway, you will need a rule to make that list. And I believe that the size of that list it not usable, so maybe the solution you want is just to check each word against that list. It may be feasible, but I am not sure if it is a good compromise in relation to the resources it takes.

And it could be done with another rules… and priorities.

Discostu · August 28, 2017, 5:15pm

Yes, but then it would be marked as incorrect, which it isn’t.

Ruud_Baars · August 29, 2017, 7:27am

That is what we did already (by the way, unmunch does not work because of the use of advanced features, but we worked around that smartly). I am also part of maintaining the Hunspell dictionary. And a lot more words for my own business.

Still there is more to spellchecking than only correct or incorrect. But I will stop this discussion.