Back to LanguageTool Homepage - Privacy - Imprint

[En] Extracting Wikipedia Snapshot into a readable text format


#1

I'm pretty sure that our Wikipedia snapshot contains a large corpus. Can I extract the entire database into a readable text format? If yes, then please let me know.


(Andriy) #2

Yes, you can fetch the wikidump (e.g. at http://dumps.wikimedia.org/enwiki/latest/) and then parse the xml.
Then there are various wikipedia xml parsing libraries for different programming languages.