Back to LanguageTool Homepage - Privacy - Imprint

Checking The Complete Wikipedia: unexpected behaviour caused by FEATURE_SECURE_PROCESSING


(Mike Unwalla) #1

I ran LT on the full en Wikipedia, as shown on http://wiki.languagetool.org/checking-the-complete-wikipedia.

The start of the output from LT shows that there is no limit to the sentences to check:
These rules are disabled: []
All spelling rules are disabled
Working on: ../enwiki-20160920-pages-articles-multistream.xml
Sentence limit: no limit
Error limit: no limit

But, processing stops before LT tests all the Wikipedia data:

23,440,000 sentences checked...
23,445,000 sentences checked...
23,450,000 sentences checked...
Exception in thread "main" java.lang.RuntimeException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[664
70483,50]
Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEA
TURE_SECURE_PROCESSING".
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.hasNext(WikipediaSentenceSource.java:84)
        at org.languagetool.dev.dumpcheck.MixingSentenceSource.hasNext(MixingSentenceSource.java:75)
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.run(SentenceSourceChecker.java:175)
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.main(SentenceSourceChecker.java:80)
        at org.languagetool.dev.wikipedia.Main.main(Main.java:45)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[66470483,50]
Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEA
TURE_SECURE_PROCESSING".
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.handleTextElement(WikipediaSentenceSource.java:144)
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.fillSentences(WikipediaSentenceSource.java:127)
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.hasNext(WikipediaSentenceSource.java:82)
        ... 4 more

D:\LanguageTool-wikipedia-3.6-SNAPSHOT>

As best I can tell, FEATURE_SECURE_PROCESSING is used by 3rd-party software that LT uses. (I did not find FEATURE_SECURE_PROCESSING in the LT Github repository.)

What, if anything, can I do to check LT rules against all the Wikipedia data?


(Daniel Naber) #2

I just tried fixing this by adding factory.setProperty(XMLConstants.FEATURE_SECURE_PROCESSING, false) to the code but then I get an exception telling me Property http://javax.xml.XMLConstants/feature/secure-processing is not supported.

Anyway, checking the whole (English) Wikipedia might take days or weeks, so you might not want to do that. I suggest splitting the huge XML into smaller parts and check them one by one. You'll need to make sure that the XML is valid at least at the beginning of each file, so you cannot just split every 100,000 or so lines without then fixing the XML manually.


(Daniel Naber) #3

This issue should now be fixed, i.e. with the next nightly build (see the commit).