Getting most out of a corpus

Ruud_Baars · October 31, 2022, 6:58am

To get the most out of my corpus, I would like to be able to split it into one that has only sentences with a mistake, one that only has lines without mistake.
The purpose of this, is to find more mistakes that are not seen yet.

Trying to get this done using the local web server is to slow. Running 8 concurrent instances of the PHP program that feeds sentences to the server, only gets the server up to 5 cpu’s, and results in just a few MB’s processed per day. And there is 20GB to do.

LT’s command line is much faster, but does not output full sentences, nor is it able to split into 2 parts.
Is anyone able to create such a utility?
Maybe it is even possible to store the output of the ones having a mistake into files per rule/subrule, or in just one file, putting rule and subrule in front of the sentence?

dnaber · October 31, 2022, 8:21am

Maybe this isn’t too difficult by changing SentenceSourceChecker, I’ll have a look.

dnaber · October 31, 2022, 8:49am

I have just added a --csv option to SentenceSourceChecker. You can call it using options like this

--csv --language nl --file input.txt

The input needs to be a file with one sentence per like, the output will be printed to the console and you’ll need to redirect it to a file, which you can then filter for matching (MATCH) and non-matching (NOMATCH) sentences.

This uses as many cores as your CPU has, so it shouldn’t be slow (but 20 GB is a lot, so it will probably still take ages).

Ruud_Baars · October 31, 2022, 9:00am

What would the full command line be? I never heard of SentenceSourceChecker…

dnaber · October 31, 2022, 9:13am

I call it directly from IntelliJ. If you want to call it from the command-line, you can build and run it like this (from top LT directory):

mvn install -DskipTests
cd languagetool-dev
mvn clean compile assembly:single
java -cp target/languagetool-dev-6.0-SNAPSHOT-jar-with-dependencies.jar  org.languagetool.dev.dumpcheck.SentenceSourceChecker --csv --language nl --file input.txt

Ruud_Baars · October 31, 2022, 9:23am

Thanks, great. It is running now. Cpu’s are at 30%. Not as much as I thought. But it is more efficient anyway.

Ruud_Baars · October 31, 2022, 3:29pm

After 18M sentences, there is a dump:

Exception in thread "main" java.lang.RuntimeException: Check failed on sentence: Dat zie je nu weer met de zogenaamde oudelullendagen die in heel andere tijden met gulle hand in cao’s werden uitgereikt aan werknemers van vijftig jaar en ouder.
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.run(SentenceSourceChecker.java:260)
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.main(SentenceSourceChecker.java:76)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Could not check sentence (language: Dutch): <sentcontent>Dat zie je nu weer met de zogenaamde oudelullendagen die in heel andere tijden met gulle hand in cao’s werden uitgereikt aan werknemers van vijftig jaar en ouder.</sentcontent>
        at org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:217)
        at org.languagetool.JLanguageTool.checkInternal(JLanguageTool.java:987)
        at org.languagetool.JLanguageTool.check(JLanguageTool.java:906)
        at org.languagetool.JLanguageTool.check(JLanguageTool.java:891)
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.run(SentenceSourceChecker.java:249)
        ... 1 more
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Could not check sentence (language: Dutch): <sentcontent>Dat zie je nu weer met de zogenaamde oudelullendagen die in heel andere tijden met gulle hand in cao’s werden uitgereikt aan werknemers van vijftig jaar en ouder.</sentcontent>
        at java.base/java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1006)
        at org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:214)
        ... 5 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Could not check sentence (language: Dutch): <sentcontent>Dat zie je nu weer met de zogenaamde oudelullendagen die in heel andere tijden met gulle hand in cao’s werden uitgereikt aan werknemers van vijftig jaar en ouder.</sentcontent>
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
        ... 7 more
Caused by: java.lang.RuntimeException: Could not check sentence (language: Dutch): <sentcontent>Dat zie je nu weer met de zogenaamde oudelullendagen die in heel andere tijden met gulle hand in cao’s werden uitgereikt aan werknemers van vijftig jaar en ouder.</sentcontent>
        at org.languagetool.JLanguageTool$TextCheckCallable.getOtherRuleMatches(JLanguageTool.java:1989)
        at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:1858)
        at org.languagetool.MultiThreadedJLanguageTool.lambda$performCheck$1(MultiThreadedJLanguageTool.java:200)
        at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:47)
        at java.base/java.lang.String.charAt(String.java:693)
        at org.languagetool.rules.nl.Tools.glueParts(Tools.java:46)
        at org.languagetool.rules.nl.Tools.glueParts(Tools.java:35)
        at org.languagetool.rules.nl.SpaceInCompoundRule.match(SpaceInCompoundRule.java:103)
        at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:1366)
        at org.languagetool.JLanguageTool.access$1500(JLanguageTool.java:74)
        at org.languagetool.JLanguageTool$TextCheckCallable.getOtherRuleMatches(JLanguageTool.java:1946)
ruud@ruud-pc:/media/ruud/data3/LT/languagetool/languagetool-dev$

Would it be possible to catch this mistake, and continue?

dnaber · October 31, 2022, 4:15pm

The actual reason for that bug is this line in multipartcompounds.txt:

oudelullendagen|extra vrije dagen voor oudere werknemers

Does it actually belong there, as it’s not a multi-part word (one with spaces, I mean)?

Anyway, this should not stop the process. I’ll have a look at that.

Ruud_Baars · October 31, 2022, 4:18pm

I corrected this mistake. The line should be there, but with spaces. Must be an old mistake, detected just now.
Clearly, this is not checked in the test for the routine.

dnaber · October 31, 2022, 4:21pm

Should there ever be words with spaces as the first part of the line in multipartcompounds.txt? If not, I’ll extend the test to complain about those.

dnaber · October 31, 2022, 4:27pm

There’s now a new option --skip-exceptions for SentenceSourceChecker. You’ll need to add it and also re-build the code with the steps described above (all the steps with mvn). With this, errors will be printed, but they shouldn’t make the script stop.

dnaber · October 31, 2022, 6:31pm

There are two other words without a space, should these also be corrected?

meerpartijenarbeidscontract
meerpartijenarbeidscontracten

Ruud_Baars · November 1, 2022, 6:34am

Yes, I will do that.

By the way, even when working on a ram disk, the utility just takes 30% of the total amount of 8 dual core processors. It is not ram, not disk that is taking the time.

Ruud_Baars · November 2, 2022, 7:39am

Gettign better all the time. Anyway, I managed to process all of the corpus, and it shrank from 20GB to 10GB. And it helps getting more relevant differences, once not already covered by some rule.

dnaber · November 2, 2022, 8:02am

I think some of the lines in that file won’t work, e.g.:

zng=zgn.    Bedoel je misschien de afkorting voor 'zogenaamd'?

Using “zng” in a text just finds a spelling error, so that the replacement rule isn’t triggered. Is that on purpose? Also, in this specific line, there are spaces instead of a tab that separate the suggestion from the pair.

Ruud_Baars · November 2, 2022, 8:22am

Good find. I removed all redundant spaces.
It is possible not all of those replace rules will hit a lot. But all of them have at least 1 occurrence in the 20GB.

It might be helpful to be able to test the amount of hits of rules in user texts.

My plan is to later analyze the replaces for common features, and make XML rules from those. But some sentences are extremely crooked, and longer sentences are extremely hard to get analyzed and create a good rule for.

Ruud_Baars · November 3, 2022, 5:17pm

If you have got a moment, could you please have a look why
java -jar languagetool-commandline.jar -l nl --level PICKY --enable-temp-off $CORPUS > $LTROOT/output1.txt
grasps 100% of the pc, while
java -cp target/languagetool-dev-6.0-SNAPSHOT-jar-with-dependencies.jar org.languagetool.dev.dumpcheck.SentenceSourceChecker --csv --language nl --skip-exceptions --file $CORPUS > output.txt
does not?

dnaber · November 3, 2022, 7:45pm

Can you send an example of a sentence that one command finds and the other doesn’t?

Ruud_Baars · November 4, 2022, 1:20pm

That is not the issue. I meant the amount of processor taken. SentenceSourceChcker does not get any higher than 30%, running 2 of them lifts it to 40% cpu, not higher.

Languagetool-command line uses all processors up to max, load of pc is 100% in total.

I have to wait multiple days for a run on the corpus. It is a lot better than using the server from php, but it would be great if the wait time for processing could be shorter.

Ruud_Baars · November 4, 2022, 3:11pm

A sentence I find is:
Ik heb al sinds drie weken last van duiziliheid.
The last word is a spelling error. So this line should not be in NOMATCH, I guess.

But does having spellcheck off make things slower? I would not expect that.