Hey there, I work as a volunteer for the Common Voice Project by Mozilla. This project needs enormous numbers of sentences that people can record to create a dataset for speech recognition. I want to import different sentences corpus’ into this project, containing hundreds of thousands of sentences, sometimes millions of sentences. Manual review is not possible, that’s why I thought it might be a good idea to write a script using languagetool that checks every line of a file, and if it contains a (red) error it deletes it completely.
Would this be possible with the languagetool-api? I basically need two things:
- Capacity for mass-checks of hundreds of thousands of sentences in a acceptable time (maybe in one hour or so)
- A way to just know if an error exists in a line, but it is irrelevant where or what kind of error it is.
I am just starting to understand the API, so maybe I can answer this myself in a few days, but I would like to hear your thoughts about this.