Back to LanguageTool Homepage - Privacy - Imprint

Addition to segment.srx needed for Dutch, or more languages


(Ruud Baars) #1

Look at this:

echo "Dr. A. Janssen" | ./segment/bin/segment -l nl -s ./segment.srx 
Dr. A. 
Janssen

It shows the name is split between the initial letter and the last name.
This should be forbidden. Not just for initial uppercase letters, but for Th. and Ph. (Theodore and Philip) too.

<rule break="no">
<beforebreak>\b(prov|pseud|red|ref|resp|soc|st|tab|tel|tk)\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="no">
<beforebreak>\b([A-Z]|Th|Ph)\.\s</beforebreak>
<afterbreak>[A-Z]</afterbreak>
</rule>
<rule break="no">
<beforebreak>\b(uitsl|ver|vgl|vnl|vnw|voorz|ww|zat|zg)\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>

Another addition needed is for the abbreviation for Saint: St. So it could be combined with this, but it also could be another rule, just what is most useful. Not all languages have St. as the abbrev for Saint (e.g. Spanish, Italian, Portuguese I think)

I tried to implement this for Dutch; but it might be more general. https://github.com/languagetool-org/languagetool/commit/2979ad40c6a4d74b4721a1bb1f1b51941dded550