Testcase: German sentence begin error detection false positives

In response to the answer of Daniel in the bug report (LanguageTool / Bugs / #185 Sentence begin error detection not working in some cases)

Using this testcase I can reproduce the problem:

–TestLanguageToolBug.java–

import static org.junit.Assert.assertEquals;

import java.io.IOException;
import java.util.List;

import org.junit.Test;
import org.languagetool.JLanguageTool;
import org.languagetool.language.German;
import org.languagetool.language.GermanyGerman;
import org.languagetool.rules.RuleMatch;

public class TestLanguageToolBug {
@Test
public void testLanguageToolBug() throws IOException {
String txt = " Wir heiraten am 19.09.12 und würden uns freuen, wenn Ihr diesen"
+ " ganz besonderen Tag mit uns feiern würdet. Zur kirchlichen Trauung"
+ " in der St. Wolfgangkirche in Regensburg laden wir um 11:00 Uhr ein."
+ " Danach werden wir gemeinsam im Gasthaus „Zur Post“ auf diesen Tag anstoßen und"
+ " beim gemeinsamen Essen feiern. Bitte gebt uns bis zum 02.09.12"
+ " Bescheid, ob Ihr kommen könnt. Wir freuen uns schon sehr auf Euch! ";
German language = new GermanyGerman();
JLanguageTool langTool = new JLanguageTool(language);
langTool.activateDefaultPatternRules();
langTool.disableRule(“WHITESPACE_RULE”);

	List<RuleMatch> result = langTool.check(txt);
	assertEquals(1, result.size());
	assertEquals("Möglicher Rechtschreibfehler gefunden", result.get(0).getMessage());
}

}

–End TestLanguageToolBug.java–

I expect to get one rule match, because “Wolfgangkirche” is unkown. But instead I get three errors.

The text which I feed into the spellchecker is the result from a html2text transformation, as the user can input the text in a wysiwyg editor in my applicaton. Because of this I ignore the whitespace rule.

I have this maven dependencies in my pom:

	<dependency>
		<groupId>org.languagetool</groupId>
		<artifactId>language-all</artifactId>
		<version>2.2</version>
	</dependency>
	<dependency>
		<groupId>org.languagetool</groupId>
		<artifactId>languagetool-server</artifactId>
		<version>2.2</version>
	</dependency>

Can you reproduce the problem using the testcase?

Thank you.

BTW: I did not get a notification that you wrote an anwser to the bug report … do i have to

Am 15.07.2013 17:05, schrieb Emmeran Seehuber [via LanguageTool User
Forum]:

Using this testcase I can reproduce the problem:

That test works fine for me, i.e. I only get one error.

Could you debug this? The errors you don’t want are probably generated
in CaseRule.potentiallyAddUppercaseMatch().

It shouldn’t make a difference, but what version of Java are you using?
Does it also happen if you strip down the text to the minimum, e.g.
remove the superfluous whitespace?

BTW: I did not get a notification that you wrote an anwser to the bug
report … do i have to

I don’t know, maybe Sourceforge doesn’t support that yet in their new
interface. In this forum, there’s a checkbox below the text area where
you can activate email notifications.

Regards
Daniel

Im using JDK 1.7. I just tried the test with JDK 1.6 and got the same error (Im working on Linux (LANG=de_DE.UTF-8) - but this should not make any difference, should it?).

Stripping whitespaces does not help here.

I already tried to debug this problem. I think the problem is, that the sentences are not correctly tokenized. I even did override the sentenceTokenizer in the JLanguageTool using some Reflection tricks to be a SentenceTokenizer, not the (default) SRXSentenceTokenizer. Did not make a difference :frowning:

After tokenizing the first sentence (sentences.get(0)) is

"Wir heiraten am 19.09.12 und würden uns freuen, wenn Ihr diesen ganz besonderen Tag mit uns feiern würdet. Zur kirchlichen Trauung in der St. Wolfgangkirche in Regensburg laden wir um 11:00 Uhr ein. "

So the CaseRule is correct in detecting an error here. But why was the sentence tokenized wrong? As far as i understand the tokenizer, this is done by some finetuned regular expressions. The only rule to match in
SrxTextIterator#initMatchers() is “[.!?…][\u0002|’|”|«|)|]|}]?\s+" - but it matches to much.

AHH, I found the problem: the space after “würdet.” is a non breaking space (Unicode 00A0, UTF-8 C2A0). Mac user can input this non breaking space quite easily, also &nbsp; in HTML translates to this. When posting in this forum the non breaking space is replaced by a normal space, so you can not reproduce the bug.

=> I can now workaround this problem on my side.

That non-breaking spaces do not match on \s is documented (Character#isWhitespace). But is it the right behavior from a spellchecker point of view?

AHH, I found the problem: the space after “würdet.” is a non breaking
space (Unicode 00A0, UTF-8 C2A0).

Thanks for keeping us up-to-date… other people might run into the
same issue.

That non-breaking spaces do not match on s is documented
(Character#isWhitespace). But is it the right behavior from a
spellchecker point of view?

I’m not sure. I also asked on the mailing list but nobody seems to have
an opinion about this.

Regards
Daniel