I’ve noticed that when converting a HTML document to raw text in order to send it to the API, we loose the important meta information, for instance if a text was a headline (or paragraphs, bullet points). This information about the formatting is important for certain grammar rules to not get caught.
For instance: If a sentence starts with a bullet point (•), it should consider it a sentence start and not complain about a capital character beginning the line.
Or if a text was a headline (z.B. “Das sind wir”) and was follow by a text also “Das sind wir” that gets converted to this (ex: “Das sind wir\nDas sind wir”) which would be detected as a repetition of words.
Any thoughts on this matter?
Here is an example: