WikiCheck gets confused by wiki and HTML markup

AxelBoldt · July 28, 2014, 10:45am

If you try a long article in WikiCheck, most reported errors are due to HTML entities like & nbsp; or templates like {{convert|10|mi}}. See for instance https://tools.wmflabs.org/languagetool/pageCheck/index?url=Paris&lang=en

I would suggest that WikiCheck use some sort of screen scraping technique to get rid of all markup before feeding the text to LanguageTool.

dnaber · July 28, 2014, 4:40pm

The “ ” are actually part of the page source, so we may not remove them, similar with the templates…

AxelBoldt · July 28, 2014, 6:29pm

Yes, the templates and HTML entities are part of the page source, but they are not what the reader actually sees and therefore not what LanguageTool should check. If you use wget to download the Wikipedia pages, all templates will have been expanded, and LanguageTool could deal with the remaining HTML markup.

dnaber · July 28, 2014, 6:36pm

We can’t just deal with the HTML, as we need to make changes, so we may not lose anything compared to the original markup. If you’d like to help, the proper solution is probably to use Parsoid: Parsoid - MediaWiki