Languagetool-scritping to find lines containing errors

stergro · January 11, 2020, 12:48pm

Hey there, I work as a volunteer for the Common Voice Project by Mozilla. This project needs enormous numbers of sentences that people can record to create a dataset for speech recognition. I want to import different sentences corpus’ into this project, containing hundreds of thousands of sentences, sometimes millions of sentences. Manual review is not possible, that’s why I thought it might be a good idea to write a script using languagetool that checks every line of a file, and if it contains a (red) error it deletes it completely.

Would this be possible with the languagetool-api? I basically need two things:

Capacity for mass-checks of hundreds of thousands of sentences in a acceptable time (maybe in one hour or so)
A way to just know if an error exists in a line, but it is irrelevant where or what kind of error it is.

I am just starting to understand the API, so maybe I can answer this myself in a few days, but I would like to hear your thoughts about this.

dnaber · January 11, 2020, 12:59pm

Hi, I think the LT API can do what you need. But the HTTP API has limitations, so if you need to check thousands of sentences you should install LT locally (HTTP Server - LanguageTool Wiki). Also, make sure to install the ngram data so all error detection rules are active (Finding errors using n-gram data - LanguageTool Wiki). Performance depends on the language, 20ms per sentence might be a good value for estimating the total time.

jaumeortola · January 11, 2020, 1:13pm

Stefan,

I wrote a little Java application based on LanguageTool for the same purpose: filtering Catalan Wikipedia sentences for the Common Voice project. We take account of errors detected by LanguageTool and other conditions like sentence length. See: GitHub - Softcatala/filter-wiki-corpus-lt: Extract sentences from Wikipedia using LanguageTool

Unfortunately the Common Voice team rejected this approach. They required to run themselves the Wikipedia filtering because of licensing issues. We had to use the tools made by Mozilla, and we got lower quality results. So make sure that your work is going to be accepted before you start working.

stergro · January 11, 2020, 1:56pm

That’s great! Thanks for your link.

I know this process and it really isn’t ideal. But you can delete sentences after the import, I did that for German once. (I deleted sentences containing non-german letters)

I want to use this tool for two things: preparing sentences for the sentence collector and analysing big sentence corpus’ like the Europarl corpus.

stergro · January 14, 2020, 8:43am

So here is a first little version of a bash script, right now it is a dirty hack that only checks if anything is said about a sentence, I will check more details in the future:

github.com

stefangrotz/common-voice-work-files/blob/master/scripts/languagetool-check.sh

#!/bin/bash
# jq https://stedolan.github.io/jq/

# insert username and API key from https://languagetoolplus.com/api-access
username='exampl@mail.me'
apiKey='XXXX'

while IFS= read -r line; do
  
  matchesJSON=$(
  curl -s -X POST --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json' -d 'text='"$line"'&language=eo&username='"$username"'&apiKey='"$apiKey"'&enabledOnly=false' 'https://api.languagetoolplus.com/v2/check' |\
  jq '.matches')
  
  matchesJSONLength=$(echo -n $matchesJSON | wc -c)
  #  echo $matchesJSONLength
  
  if [ "$matchesJSONLength" -eq "6" ]; then
        echo "$line"
  fi

This file has been truncated. show original

I’ve chosen Esperanto for testing because it has less rules than German so there are fewer false negatives that are just style comments. But in the future this script will work for any language and I will ignore some kinds of errors.