False postive with PUNCTUATION_PARAGRAPH_END in text mode

When using LanguageTool-4.3-SNAPSHOT in command line (actually using the Vim plugin, which uses LT in command line), I see many false positives with the rule PUNCTUATION_PARAGRAPH_END which treats end of line as paragraph separator. See this screenshot:

My understanding, is that LT in command line should not treat single end of line as paragraph separators. It should treat empty lines as paragraph separator. Treating single end of line as paragraph separator should only happen when the -b flag is given. See usage of LT in command line:

Usage: java -jar languagetool-commandline.jar [OPTION]... FILE
...snip...
 -b       assume that a single line break marks the end of a paragraph

Could you post an example directly using languagetool-commandline.jar? I cannot reproduce this problem yet.

Sure. Here is an example:

$ cat example.txt 
A sentence. This is
an example.

$ java -jar /home/pel/sb/languagetool/languagetool-standalone/target/LanguageTool-4.3-SNAPSHOT/LanguageTool-4.3-SNAPSHOT/languagetool-commandline.jar -l en example.txt
1.) Line 1, column 17, Rule ID: PUNCTUATION_PARAGRAPH_END
Message: Please add a punctuation mark at the end of paragraph
Suggestion: is.; is!; is?; is:; is,; is;
A sentence. This is an example. 
                 ^^             
Time: 1223ms for 2 sentences (1.6 sentences/sec)

$ java -jar /home/pel/sb/languagetool/languagetool-standalone/target/LanguageTool-4.3-SNAPSHOT/LanguageTool-4.3-SNAPSHOT/languagetool-commandline.jar --version
LanguageTool version 4.3-SNAPSHOT (2018-08-23 06:08)

Indeed the rule seems to come with its own paragraph logic, while at least the existing sentence splitting should be used. @Fred.Kruse can you maybe have a look?

Indeed the rule implements a own paragraph logic (\n, \n\r, \r\n). It was tested for the standalone version and the office extension. I didn’t found a more general paragraph logic. @dnaber: Is there any?

AnalyzedTokenReadings has a isParagraphEnd() method, can you maybe use that?

I remember I used this method before, but discarded it. Just I did some test. It don’t work. In case of the standalone version only the last token is tagged as end of paragraph. In the case of LO extension no paragraph is detected at all, the same for the command-line tool.
I agree, it would be the best solution to use this method, but then the functionality must be available. I’m not familiar with the functionality of tagging inside LT and have only very few time to work on the problem in the next weeks. Is there any one who could make the isParagraphEnd() method work. After that I will implement it in the concerned rules.

Update: I’m working on this now.

This should be fixed now, please give it a try. I have only tested it in the context of a unit test.

Hi Daniel, I tested the change. I’m sorry, but it does not solve the problem. Your change tests the end of a sentences, but a paragraph break (\n etc.) is not detected as end of a sentences, I think.
This is the reason why some rules which deals with paragraph ends seams so complicate. I had to test every token, if it is a paragraph break. It appears inside a sentence, just if there is no ending punctuation mark.

Could you give a concrete example text where it doesn’t work as needed?

First I did a very easy change to the PunctuationMarkAtParagraphEnd rule. Change the function isParaBreak to:
return token.isParagraphEnd();
_// return “\n”.equals(token.getToken()) || “\r\n”.equals(token.getToken()) || “\n\r”.equals(token.getToken());
and compile.
Then I wrote some little sentences for example to the standalone application (this test can also be done with the LO extension):

Dies ist ein Test. Mal sehen, ob er funktioniert
Hier steht noch ein Satz. Und hier steht ein zweiter

‘funktioniert’ should be underlined, because it stands at the end of a paragraph without punctuation mark, but is not. ‘zweiter’ is underlined because it is the end of the text and that is the end of the last sentences too, so the end of paragraph is found.

The issue was that the paragraph marker was on the whitespace, not on the last (non-whitespace) token. I think that’s fixed now. I have also adapted isParaBreak().

As far as I see, it fixes not the whole problem. Just when a sentence not ended with a punctuation mark at the end of a paragraph, the end of the sentence is not detected. The sentence is continued in the next row. For singleLineBreaksMarksPara == false this is right. For singleLineBreaksMarksPara == true a end of sentence should be detected, I think.
isParagraphEnd() test the end of a sentence. A paragraph break in the middle of a sentence isn’t found.
An alternative could be to test for singleLineBreaksMarksPara and consider this in functions like isParaBreak().

Please ignore my last comment!
I added setSingleLineBreaksMarksParagraph(true) to the LO extension now the isParagraphEnd() method works. The problem was the sentence Tokenizer didn’t recognized the paragraph breaks.
I think the same should be done to the standalone application. Do you agree?

Should not \n\n be in this check as well, for Linux text files? And maybe \r\r for Mac (not sure about that…)

These changes seem to have an undesired effect in rules that have an empty token with an exception in the last token like this. When the error is at the end of the sentence and the sentence has no ending punctuation, the rule doesn’t work.

Can this be fixed? If not, the exception should be re-written as antipattern (perhaps not always possible).

As I don’t have time now to deal with the exceptions, I will now roll back my change but make sure that the rule still considers the line break setting. Please give it a try once it’s pushed.

As far as I see the rule works correct with the line break settings.
But this leads me back to the question I asked before. The rule works inside the standalone application only for a paragraph break followed by a empty line.
I think, the standalone application should also set to singleLineBreak, because if you copy the text out of any office product (LibreOffice, Microsoft, …) to the standalone version. The paragraph break is done by a single break.
I could add setSingleLineBreaksMarksParagraph(true) to the application, but would like to here your opinion first.

Sounds sensible, but I don’t really have an opinion on that. What do others think?