How to mark "non-words"

koalillo · January 30, 2023, 5:07pm

I am writing a small tool that verify HTML files (esp. to convert AsciiDoc files to HTML, and verify those).

Consider the following piece of text.

A combination of the foo, bar, and baz items.

Foo, bar, and baz, are enclosed in <pre>. I am already using annotations because the bits in <pre> may not be correct English. However, that makes LanguageTool see:

A combination of the, and items.

So I get:

Articles like ‘the’ are rarely followed by punctuation. A word may be missing after ‘the’, or the punctuation mark may not be necessary.

and diverse spacing errors. Can I tell the API that something is a non-word (so it shouldn’t trigger spellchecking) but still make it “gramatically”-significant? (e.g. I think it should “think” there’s an adjective there).

koalillo · January 30, 2023, 5:15pm

(ugh, I see I had already asked this here Draft AsciiDoc integration - #2 by koalillo , but the initial post in that thread was flagged because… it was spam? I wanted to post a link to a project I’m starting that I think can be useful, but now I cannot link to GitHub?)

dnaber · January 30, 2023, 5:29pm

Does sending the text as JSON using data and JSON with text and markup help? It’s documented at LanguageTool HTTP API

dnaber · January 30, 2023, 5:29pm

Not sure why that happened, I have restored the post.

koalillo · January 30, 2023, 5:40pm

So with this sentence:

The foo, bar, and baz tokens.

, this is the annotation I’m sending:

{'text': 'The '},
{'markup': 'foo', 'interpretAs': ' '},
{'text': ', '},
{'markup': 'bar', 'interpretAs': ' '},
{'text': ', and '},
{'markup': 'baz', 'interpretAs': ' '},
{'text': ' tokens.'},

and that results in:

...ootnotes,#footer{padding:0}}        The foo, bar, and baz tokens.     Last updated 2...
                                           ~~~~
Put a space after the comma, but not before the comma.
________________________________________________________________________________
...tes,#footer{padding:0}}        The foo, bar, and baz tokens.     Last updated 2023-0...
                                           ~~~~
Two consecutive commas
________________________________________________________________________________

I think I could solve the problem by sending “green” instead of a space in interpretAs, but that seems like an ugly hack.