Segment.srx: new rule for Jones v. Smith

To prevent LT from splitting a sentence on a word that contains a full stop, we use segment.srx (Customizing Sentence Segmentation In SRX Rules - LanguageTool Wiki).

  1. File location \languagetool-core\src\main\resources\org\languagetool\resource\segment.srx is in the GitHub clone, but it not is in the daily snapshot. If I add a rule to segment.srx, how do I test it to make sure that the new rule is correct?

  2. I see that the Ratel editor lets me “Apply the rules on input files for testing” (Ratel - Okapi Framework). It that testing the reason for using an SRX editor?

  3. LT has incorrect segmentation for the second sentence:
    Correct: It’s a case of cats v. dogs.
    Incorrect: It’s a case of Jones v. Smith.

Is this rule correct to prevent a break on the ‘v.’ in Jones v. Smith? (At this stage, I do not want to spend time downloading and learning how to use an SRX editor.)

<rule break="no">
<beforebreak>\b[A-Z][a-z]+\sv\.\s[A-Z][a-z]+</beforebreak>
<afterbreak></afterbreak>
</rule>
  1. The SRX rules are cascading. Where should the rule go in the English section of segment.srx? At the end of the ‘no’ rules?

You can test the segmentation by running mvn test if you have a full developer setup. If unsure, you can also send the suggested change to this forum.

I must admit I have never used the SRX edit but always just edited the file directly.

v. is just another abbreviation like others which should be in the file already, so the easiest approach is usually to find an existing abbreviation and add the new one to that list (the “list” is just a regexp).

Some example sentences with wrong segmentation in LT:

  • This is Thm. 1.
  • This is Lem. 1.
  • This is Prop. 1.
  • This is Def. 1.
  • This is Thm. (1).
  • This is Eq. (1).
  • It can be seen loc. cit.
  • It can be seen ibid. some text.
  • It can be seen idem. some text.

@dnaber, thanks. I will try later today.

@BebraiPuola, thanks. I will see what I can do. No guarantees; as you can see from my question, I am a not an expert.

@BebraiPuola, I corrected some of the errors ([en] Improve segmentation · languagetool-org/languagetool@e76a31e · GitHub).

@dnaber,

OK, I did this. But, how does this work if there is no example text against which to test the new segmentation rule?

(I used mvn package -DskipTests to make a GUI. Then, I pasted the examples into the GUI to verify that LT did not split the sentences.)

I didn’t find one, so I made this new rule:

<rule break="no">
<beforebreak>\p{Lu}\p{L}+\sv\.\s\p{Lu}\p{L}+</beforebreak>
<afterbreak></afterbreak>
</rule>

The rule not give the expected result. As best I can tell, the last rule for English splits the text on the full stop. The wiki tells me “No-break rules should precede the break rules in the file”. So, is it possible to prevent sentence splitting on text such as “Jones v. Smith”? If yes, how?

1 Like

I meant entries like this, which could - I think - just be extended:

<beforebreak>\b(pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl?|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|dia|lbs|\d+-(:?oz|kc|in|h[rp]|ml)|M?sec)\.\s</beforebreak>

When you make a “no break” rule, you are considering in the ‘afterbreak’ section the part that was after the segmentation without that rule. So, to make it work as you expect, you would have to transform your rule into this:

<rule break="no">
<beforebreak>\p{Lu}\p{L}+\sv\.\s</beforebreak>
<afterbreak>\p{Lu}\p{L}+</afterbreak>
</rule>

@tiagosantos, thanks. Done ([en] Improve segmentation · languagetool-org/languagetool@7d2b50d · GitHub).

1 Like