SRX rule for "FRITZ!Box"

martin.von.wittich · December 11, 2016, 2:34am

Hi,

I’m currently trying to spellcheck our documentation, and I noticed that LanguageTool doesn’t like “FRITZ!Box” (the name of a common German DSL router):

www0.iserv.eu ~ # echo 'FRITZ!Box' | java -jar /root/LanguageTool-3.5/languagetool-commandline.jar -l de-DE Expected text language: German (Germany) Working on STDIN... 1.) Line 1, column 7, Rule ID: DE_SENTENCE_WHITESPACE Message: Fügen Sie zwischen Sätzen ein Leerzeichen ein Suggestion: Box FRITZ!Box ^^^ Time: 655ms for 2 sentences (3.1 sentences/sec)

That wasn’t as easy to fix as the other things I’ve stumbled over because LanguageTool is actually splitting the string into two sentences:

www0.iserv.eu ~ # echo 'FRITZ!Box' | java -jar /root/LanguageTool-3.5/languagetool-commandline.jar -l de-DE -t Expected text language: German (Germany) Working on STDIN... <S> FRITZ[FRITZ/null,O]![</S>!/PKT,O] <S> Box[Box/SUB:AKK:SIN:FEM,Box/SUB:DAT:SIN:FEM,Box/SUB:GEN:SIN:FEM,Box/SUB:NOM:SIN:FEM,</S>,B-NP|NPS]

Fortunately I had just learned from Daniel’s commit in response to another one of my bug reports that a file called segment.srx is reponsible for splitting sentences, so I extracted it from languagetool-core.jar and I was able to figure out a rule that solves this:

<rule break="no"> <beforebreak>(?i)FRITZ!</beforebreak> <afterbreak>(?i)Box</afterbreak> </rule>

dnaber · December 11, 2016, 12:41pm

Thanks, I’ve added this to our segment.srx.