Porting Language Tool to Python

tokestermw · November 29, 2017, 12:31am

Hi, I’m keen on porting some of the Language Tool functionality to Python (open sourced, probably MIT license).

More specifically, I’d like to match on tokens based on its text, POS tag, etc. given the rules in grammar.xml. I was also going to translate the grammar.xml to grammar.yml for readability. There will be some enhancements like adding dependency tags to match on (so possibly another grammar.yml).

I wasn’t sure what was permitted with your license. Would it fine to copy over the grammar.xml to my repo (with the LT license) and have a README that link to the github repo?

dnaber · November 29, 2017, 9:07am

Frankly, as the developer of LT I’m not a big fan of this idea. Like many Open Source projects, LT could be so much better if we had more contributors. But we don’t. Now if the very limited resources we have get spread over even more versions of LT, I think that would not help improve LT. Also, we offer a very easy to use HTTP server that can be started with a single command and that returns JSON which can be used from any programming language.

Another aspect is that LT relies on quite some external libraries, and they will probably not all exist for Python. So these features will be missing. We also have complex Java-based rules which either will need to be ported or they will simply be missing. Plus statistics-based detection of errors. Plus ongoing work on neural networks.

Of course the license allows you to take the LT files and use them in your project, no matter what language you use. Using .yml instead of .xml might be a good idea anyway (also for the Java version).

tokestermw · November 29, 2017, 6:02pm

I understand.

Though I’m thinking porting only the grammar.xml and not the entire program.

For example the NLP package spaCy has a matcher API. I could write some rules for matching on grammar, but I can avoid the cold start problem by porting over the LT rules.

So it’s less an LT port but a grammar matching package for spaCy augmented greatly with LT rules (grammar.xml). There will be differences too, e.g., spaCy and LT return different sets of properties to match on.

I agree if it’s just using LT in Python, using the server would be best.

If there’s interest I could contribute a yml translation script.

Also didn’t know about the neural network feature. Looks good.

dnaber · November 29, 2017, 6:24pm

Could you provide an example how a rule in yml would look like?

tokestermw · November 29, 2017, 6:33pm

Here’s something that I’m working on (for “as follow” rule).

typos: 
  as_follow_as_follows: 
    corrections: 
      - Do you mean "as follows"?
    description: ~
    examples: 
      - We can elaborate this distinction as follow.
    patterns: 
      # direct string match
      - as follow
      # match on list of dictionaries
      - 
        - 
          LOWER: as
          POS: ADP
        - 
          LOWER: follow

qsam · September 4, 2019, 6:32am

Not to resurrect a dead thread, but I am interested in this as well. My Python skills are much better than my Java, and I think that is true for many in the NLP community. Also, it seems like, at least for the word2vec component, some parts of LT are done in Python anyway.

Given the timing of this question and its contents, I assume that OP is the author of the spacy-grammar extension, which seems like a good foundation.

johanneswilm · September 5, 2019, 9:32am

Fidus Writer is a python program using languagetool through the API. But it’s a pain to package due to the Java dependency. It’s not quite problematic enough to do much about it, but if there is a Python version available some day,. I’m sure we’ll use that instead.

bminixhofer · December 31, 2020, 9:17am

Sorry to revive this yet again but I made something to address these problems:

NLPRule is a library to parse and run LanguageTool rules from grammar.xml and disambiguation.xml. Currently all disambiguation and about 80% grammar rules of English and German rules are supported.

It is written in Rust for speed but has bindings for Python.

I’m actively maintaining (and probably enhancing) it. I’m not quite happy with the speed yet but I wouldn’t be surprised if it is already faster than LanguageTool.