Spellchecker, words status (again)

I really feel spell checking is more than good/wrong. Statuses might be (as far as a single word is concerned) :

  • certainly wrong
  • officially correct
  • and lots of variants in between, which might require a kind of info to explain what is the matter. There might be several categories here: like ‘considered insulting’, ‘considered very formal’, ‘considered very informal’, 'considered ‘archaic’, ‘can be correct, but often a mistake for ####’.

Am I the only one thinking along these lines?

Remember the priority system, create by Jaume, that I told you about?
It does that.

Look at the example.
HUNSPELL_RULE - the spellchecker class that [pt] uses - only reports if no other rule catches the word before it, and provide a more fitting advice. The best thing about it is that you can disable that advice or add extra advices/analysis, by enabling/disabling related rules.

You can get fairly decent lists from wikipedia and wiktionary.

  @Override
  public int getPriorityForId(String id) {
    switch (id) {
...
      case "PROFANITY":                 return -6;
      case "PT_MULTI_REPLACE":          return -10;
      case "PT_PT_SIMPLE_REPLACE":      return -11;
      case "PT_REDUNDANCY_REPLACE":     return -12;
      case "PT_WORDINESS_REPLACE":      return -13;
      case "PT_CLICHE_REPLACE":         return -17;
      case "CHILDISH_LANGUAGE":         return -25;
      case "ARCHAISMS":                 return -26;
      case "INFORMALITIES":             return -27;
      case "PUFFERY":                   return -30;
      case "BIASED_OPINION_WORDS":      return -31;
      case "WEAK_WORDS":                return -32;
      case "PT_AGREEMENT_REPLACE":      return -35;
      case "HUNSPELL_RULE":             return -50;
...
    }
    return 0;
  }
}

Don’t be intimidated because you will not have to code anything more complex than what you already do in Java. This is really just an encapsulated item list.

I guess this is not about spellchecking. I was not talking about priorities, just about differnent statusses for spellchecking. Just black-whits is not good enough. There is at least : known to be correctly spelled (e.g. speed), known to be incorrectly spelled (spead) or in just not known. This might require at least lists: one of words correctly spelled ,one with words known to be incorrect; words in the latter and not in both could be used to generate suggestions. But there might also be advice for words known.
So I myself would plea for just one list, with statuses correct, incorrect and either with remarks (and or suggestions)

For German, we do this with rules in grammar.xml. This way users do get a special message for colloquialisms, gender neutrality and for words that have more than one correct spelling but one of them being preferred by the Duden dictionary.

But of course writing a whole rule for every word isn’t very efficient and something like this would be much more handy:

Entered wordsuggested word(s)category
Hard Disk…Harddisk…Duden suggestion
geil…gut, großartig, toll…Colloquialisms

And then a pre-defined message for each category.

From a more abstract level, a word is either correct (= contained in the dictionary) or wrong (= not in the dictionary).
For style issues, gender neutrality etc. one has to write one ore more rules (and each rule can have a different priority). There are good example of very generic rules which do not require a one-rule-per-word approach (e.g., gender neutrality in German)

If you do not want to provide a suggestion you could also write another rule where you collect all inappropriate words and add a very generic message (e.g., “The word \1 is not acceptable in formal writing”)

@Discostu: For your example “Hard Disk” → “Harddisk” you might add it to compounds.txt which is a generic rule to handle (non-)hyphenation of words.

I completely disagree every list has a certain number of words, the words known.But every language is flexible enough to generate words when needed. Every day new words appear. This means there is a dictionary with words that are correct, maybe a list fir the ones that are wrong and a huge list of wirds without a known status. And for Dutch, there is an offucal spelling, but that one only applies to government and education. Others may use other spellings, and they do.

I am not sure I understand what you mean, or if there is a clear concept, because it seems to keep changing.

Anyway:

… you want a list of all (~a great deal) compounds that are valid for your language, following common affixation rules. Fast and hackish could be done this way:

  • Get the dict, for example, here:
    nl_NL - libreoffice/dictionaries - main, development dictionaries repository
  • unmunch nl_NL.dic nl_NL.aff or…
    *** Add the hunspell rule with all baseform with the .dic file with this format: word/[all affixes letters]\n use the same .aff
  • In the hunspell rule name the error string with something such as “Not a possible Dutch word”.
  • As then set the Morphologic priority to higher than this general hunspell rule.

Anyway, you will need a rule to make that list. And I believe that the size of that list it not usable, so maybe the solution you want is just to check each word against that list. It may be feasible, but I am not sure if it is a good compromise in relation to the resources it takes.

And it could be done with another rules… and priorities.

Yes, but then it would be marked as incorrect, which it isn’t.

That is what we did already (by the way, unmunch does not work because of the use of advanced features, but we worked around that smartly). I am also part of maintaining the Hunspell dictionary. And a lot more words for my own business.

Still there is more to spellchecking than only correct or incorrect. But I will stop this discussion.