Back to LanguageTool Homepage - Privacy - Imprint

WikiCheck gets confused by wiki and HTML markup


(AxelBoldt) #1

If you try a long article in WikiCheck, most reported errors are due to HTML entities like & nbsp; or templates like {{convert|10|mi}}. See for instance https://tools.wmflabs.org/languagetool/pageCheck/index?url=Paris&lang=en

I would suggest that WikiCheck use some sort of screen scraping technique to get rid of all markup before feeding the text to LanguageTool.


(Daniel Naber) #2

The " " are actually part of the page source, so we may not remove them, similar with the templates...


(AxelBoldt) #3

Yes, the templates and HTML entities are part of the page source, but they are not what the reader actually sees and therefore not what LanguageTool should check. If you use wget to download the Wikipedia pages, all templates will have been expanded, and LanguageTool could deal with the remaining HTML markup.


(Daniel Naber) #4

We can't just deal with the HTML, as we need to make changes, so we may not lose anything compared to the original markup. If you'd like to help, the proper solution is probably to use Parsoid: http://www.mediawiki.org/wiki/Parsoid