A way to indicate headlines/paragraphs/bullet points to the API

roundrobin · March 10, 2018, 2:46pm

I’ve noticed that when converting a HTML document to raw text in order to send it to the API, we loose the important meta information, for instance if a text was a headline (or paragraphs, bullet points). This information about the formatting is important for certain grammar rules to not get caught.

For instance: If a sentence starts with a bullet point (•), it should consider it a sentence start and not complain about a capital character beginning the line.

Or if a text was a headline (z.B. “Das sind wir”) and was follow by a text also “Das sind wir” that gets converted to this (ex: “Das sind wir\nDas sind wir”) which would be detected as a repetition of words.

Any thoughts on this matter?

Here is an example:

dnaber · March 10, 2018, 3:05pm

This should be easy to solve: just add two lines breaks (\n\n) after the item when converting the text.

roundrobin · March 10, 2018, 3:13pm

Hm, but how should I convert it back, if I want to highlight the matches back in my visual frontend?

I need a rule for converting it back, in a stateless way.