[En] Extracting Wikipedia Snapshot into a readable text format

RuleFreak · February 1, 2017, 4:09pm

I’m pretty sure that our Wikipedia snapshot contains a large corpus. Can I extract the entire database into a readable text format? If yes, then please let me know.

arysin · February 1, 2017, 4:43pm

Yes, you can fetch the wikidump (e.g. at http://dumps.wikimedia.org/enwiki/latest/) and then parse the xml.
Then there are various wikipedia xml parsing libraries for different programming languages.