Back to LanguageTool Homepage - Privacy - Imprint

Segment.srx: new rule for Jones v. Smith


(Mike Unwalla) #1

To prevent LT from splitting a sentence on a word that contains a full stop, we use segment.srx (http://wiki.languagetool.org/customizing-sentence-segmentation-in-srx-rules).

  1. File location \languagetool-core\src\main\resources\org\languagetool\resource\segment.srx is in the GitHub clone, but it not is in the daily snapshot. If I add a rule to segment.srx, how do I test it to make sure that the new rule is correct?

  2. I see that the Ratel editor lets me “Apply the rules on input files for testing” (http://okapiframework.org/wiki/index.php?title=Ratel). It that testing the reason for using an SRX editor?

  3. LT has incorrect segmentation for the second sentence:
    Correct: It’s a case of cats v. dogs.
    Incorrect: It’s a case of Jones v. Smith.

Is this rule correct to prevent a break on the ‘v.’ in Jones v. Smith? (At this stage, I do not want to spend time downloading and learning how to use an SRX editor.)

<rule break="no">
<beforebreak>\b[A-Z][a-z]+\sv\.\s[A-Z][a-z]+</beforebreak>
<afterbreak></afterbreak>
</rule>
  1. The SRX rules are cascading. Where should the rule go in the English section of segment.srx? At the end of the ‘no’ rules?

(Daniel Naber) #2

You can test the segmentation by running mvn test if you have a full developer setup. If unsure, you can also send the suggested change to this forum.

I must admit I have never used the SRX edit but always just edited the file directly.

v. is just another abbreviation like others which should be in the file already, so the easiest approach is usually to find an existing abbreviation and add the new one to that list (the “list” is just a regexp).


#3

Some example sentences with wrong segmentation in LT:

  • This is Thm. 1.
  • This is Lem. 1.
  • This is Prop. 1.
  • This is Def. 1.
  • This is Thm. (1).
  • This is Eq. (1).
  • It can be seen loc. cit.
  • It can be seen ibid. some text.
  • It can be seen idem. some text.

(Mike Unwalla) #4

@dnaber, thanks. I will try later today.

@BebraiPuola, thanks. I will see what I can do. No guarantees; as you can see from my question, I am a not an expert.


(Mike Unwalla) #5

@BebraiPuola, I corrected some of the errors (https://github.com/languagetool-org/languagetool/commit/e76a31e7f80683a715b1d2fa573d01da67349bcb).

@dnaber,

OK, I did this. But, how does this work if there is no example text against which to test the new segmentation rule?

(I used mvn package -DskipTests to make a GUI. Then, I pasted the examples into the GUI to verify that LT did not split the sentences.)

I didn’t find one, so I made this new rule:

<rule break="no">
<beforebreak>\p{Lu}\p{L}+\sv\.\s\p{Lu}\p{L}+</beforebreak>
<afterbreak></afterbreak>
</rule>

The rule not give the expected result. As best I can tell, the last rule for English splits the text on the full stop. The wiki tells me “No-break rules should precede the break rules in the file”. So, is it possible to prevent sentence splitting on text such as “Jones v. Smith”? If yes, how?


(Daniel Naber) #6

I meant entries like this, which could - I think - just be extended:

<beforebreak>\b(pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl?|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|dia|lbs|\d+-(:?oz|kc|in|h[rp]|ml)|M?sec)\.\s</beforebreak>

(Tiago F. Santos) #7

When you make a “no break” rule, you are considering in the ‘afterbreak’ section the part that was after the segmentation without that rule. So, to make it work as you expect, you would have to transform your rule into this:

<rule break="no">
<beforebreak>\p{Lu}\p{L}+\sv\.\s</beforebreak>
<afterbreak>\p{Lu}\p{L}+</afterbreak>
</rule>

(Mike Unwalla) #8

@tiagosantos, thanks. Done (https://github.com/languagetool-org/languagetool/commit/7d2b50d95f0acd5ac62b91569cf5df94184de893).