Back to LanguageTool Homepage - Privacy - Imprint

[En] Extracting Wikipedia Snapshot into a readable text format


I’m pretty sure that our Wikipedia snapshot contains a large corpus. Can I extract the entire database into a readable text format? If yes, then please let me know.

(Andriy) #2

Yes, you can fetch the wikidump (e.g. at and then parse the xml.
Then there are various wikipedia xml parsing libraries for different programming languages.