I’m using LanguageTool to perform some spell checking of Swedish text, but I’ve noticed that matches include adjacent punctuation, e.g. “Pnkt.” will list the match as 5 characters long.
But when using English punctuation is not included, .e.g. “Dott.” will be listed as 4 characters long.
I’m quite new to LanguageTool, so I have no idea where to start looking, but I am a developer, so with a little help I hope to be able to fix it.
this is caused by WORDCHARS -0123456789.: in sv_SE.aff. Not sure what side effects you get if you remove the dot from that line. It doesn’t happen in English as LT uses a different approach there, namely a binary (Morfologik) dictionary instead of the plain text hunspell dictionaries used for Swedish.
You could try removing the dot and see if the tests (mvn test) still work.
Thanks for the suggestion - it seems to work great! All the tests pass, and I did some manual testing that seemed to work. I created a pull request on github with the change.
Do you know why this was included to start with? At first I thought that it might be to allow for . in abbreviations such as “t.ex.” but that doesn’t seem to work even before the change. That is, the parser thinks that the word after the abbreviation is the start of a new sentence, which it not always is.
I tried to use the rule editor, but couldn’t get it to parse the abbreviation as a word in the sentence and not the end of the sentence. But perhaps I should open up another issue regarding that?
I’s just the original .aff file we’re using here, other languages like German include it because there are abbreviations like bzw. where the dot is part of the word. Here are some words from Swedish that end with a dot, do they still work as expected?
Yeah, that’s what I thought. But as I was trying (poorly) to explain - it didn’t seem to work properly even when the dot was there. While the parser includes the dot if I for example misspelt t.ex. as t.exx. it still suggests that the word after should start with a capital letter. It thinks it is the start of a new sentence because it is preceded by a dot.
This is the output from Show analysis on the rule editor page:
Even when removing the . from WORDCHARS the dot is still included in the match. But the problem with the sentence ending also remains.
That’s because spell checking is independent of almost everything else in LT. To change sentence segmentation, this file needs to be adapted. It would need to know about the abbreviations. However, a typo would still confuse the system, so the user needs to correct the spelling errors first in order to make the other errors more sensible.
Ah, ok, that makes sense. I’ll try and update that file then, as the error is there even when there is no spelling mistake.
Should I make the rule for Swedish only, i.e. languagerulename="Swedish"?
And is it ok to change sv from languagerulename="Generic" to languagerulename="Swedish"? I assume that the generic rules will still apply?
What is the <okpsrx:sample ...> used for - and should I update that as well? From language="pl_two" I assume it is supposed to be in Polish, but it looks like it also contains English and Dutch. Is that to test the generic cases perhaps?
and create a new section <languagerule languagerulename="Swedish">. I’ve never used okpsrx:sample I think, so it can probably just be ignored. I think it’s used if you use the GUI that can edit that file (see here).