[GSoC] Idea - DSL for rules in Kotlin

Hello, everyone,

Since I’ve got interested in LanguageTool I’ve been thinking a lot about how amazing it would be if there was a good flexible and unique way to write rules. And so, I’ve been in love with this idea for a few weeks now.

First, there are some things needed to take into account while reading this post:

  1. Kotlin compiles into JVM and thus can be added to project absolutely painlessly: all existing Java code will continue working. Furthermore, Kotlin functions can be called from Java and vice versa.
  2. All existing XML rules don’t have to go anywhere, but worth noticing that all of them can be written as Kotlin DSL rules, I’ll show an example below.

All right, I’ll just start with an example and then talk about it a bit.
Here’s the first English rule in grammar.xml:

<category id="AMERICAN_ENGLISH" name="American English phrases" type="locale-violation">
        <rule id="ZIP_CODE_POSTCODE" name="zip code/postcode">
            <pattern>
                <token>zip</token>
                <token postag_regexp="yes" postag="NN:UN|NNS" regexp="yes">codes?</token>
            </pattern>
            <message>The term '\1 \2' is common for American English. Did you mean <suggestion><match no="2" postag_regexp="yes" postag="NN(S)|NN:UN" postag_replace="NN$1">postcode</match></suggestion>?</message>
            <url>www.learnenglish.de/mistakes/USvsBrEnglish.html</url>
            <short>AmE/BrE: zip code/postcode</short>
            <example correction="postcode">Please enter your <marker>zip code</marker>.</example>
            <example correction="Postcodes"><marker>Zip codes</marker> are not always necessary.</example>
        </rule>
</category>

And here’s the same rule in Kotlin DSL:

category(id="AMERICAN_ENGLISH", name="American English phrases", type="locale-violation"){
        rule(id="ZIP_CODE_POSTCODE" name="zip code/postcode") {
            pattern {
                token {+"zip"}
                token(postag_regexp="yes" postag="NN:UN|NNS" regexp="yes") {+"codes?"}
            }
            message {
                +"The term '\1 \2' is common for American English. Did you mean "
                suggestion{match(no="2" postag_regexp="yes" postag="NN(S)|NN:UN" postag_replace="NN$1") {+"postcode"} }
                +"?"
            }
            url {+"www.learnenglish.de/mistakes/USvsBrEnglish.html"}
            short {+"AmE/BrE: zip code/postcode"}
            example(correction="postcode") {+"Please enter your"; marker{"zip code"}; +"."}
            example(correction="Postcodes") {marker{"Zip codes"}; +" are not always necessary."}
        }
    }

In Kotlin that rule could’ve been written in a more simple way, but I’m just showing that it basically can be the same.

What’s the point?

  1. This rule can be easily compiled into XML. It might be something like: rule.toXML() / englishRules.saveTo(“…/grammar.xml”) or somewhat else.
  2. It can also be made a Java “Rule” object. May be it might already extend “Rule” class, I’ll have to check.
  3. For custom Java rules, the same syntax would be used, but instead of pattern / tokens, there would be custom matcher() function (Kotlin is quite functional language)
  4. All XML rules can be transformed into this format (I would write a script).
  5. Uniformity and flexibility: all rules would be in the same format, which would allow to add new format for rules (compile it to something else), change existing one (decide to add new tags to existing XML rules) or just navigate easily and have all rules in the same place.

So, if you’re interested as much as I am (or a bit less), I can write a prototype and show some a working example with rule, generating into XML (and/or Java). I hope I made it clear, that to enable this feature we don’t have to deconstruct any of existing parts - it’s purely an addition.

Kotlin DSL for HTML
My very simple DSL for building XML maps - based on the previous example (kotlinx.html)

Hi
I’m sorry, noone gave any feedback. Is that a sign of a dislike for this idea? :frowning:

At least for me for me, the advantages aren’t clear yet when comparing this to the amount of work this would cause…

I agree with David, I was contemplating with the same idea (but providing DSL with groovy) but the benefits didn’t seem to be worth the trouble.

If we’re talking about transfering all rules to DSL, amount of work is unreasonably big. But I’d imagine this firstly as an alternate way to create rules for people. I.e. they write rules in DSL => rules are created in .xml file. Or, if they have custom matchers, a “Rule” object is created.

So, minimal amount of work for that would be:

  • Writing DSL itself (quite simple since it’s just XML)
  • Writing documentation and user-guide
  • Set up a proper way to work with “grammar.xml” (addition of new DSL rules and synchronize modification)

Sure, I understand, what you probably mean: there are bigger priorities in the project and this is just the way I saw potential improvements in this project. The idea of giving simpler and more flexible interface for writing rules seemed amazing to me.

Thanks for replies :}

Hi. I am one of the less technical rule makers (Dutch). To me, your example is just as complex as the xml is. The web form for prototyping a rule is much easier. I guess the new format is familiar to you, but it seems less readable than xml to me.

Thank you for your feedback!
Yeah, you’re probably right :]

Hi,

There’s something I’d expect the Kotlin rules to be much better than the XML ones, and is performance. Compiled rules should be faster than XML ones. Am I correct?

If that’s the case, the clear advantage of having such a DSL is that, at build time, XML rules could be converted into Kotlin/Groovy and, then, get compiled, what would bring better performance.

Does it make sense?

But then we’d lose the advantage of having an easy-to-edit format (XML) that anybody can edit without compiling the software, don’t we?

If it comes with the benefit of better performance, I think it may be worth it. I’m not sure how many people that downloads LT will change the rules.

In any case, something could be written so conversion is a build-time option.

I think it’s worth giving it a try and, depending on the results, keep working or discard the idea. A 5% improvement probably means having the flexibility is better, but a 80% may change people’s mind.

My point is the amount of people contributing rules is not large as it is now. Making things even more complex will make getting maintainers even more difficult.

Editing XML rules only is a good way to start, and could be all that is needed for a language to get started. That is the way I started with Dutch 10 years ago. Even XML is difficult enough for people that know the language well, but are unfamiliar with programming.

My 2 cents is more professional language specialists would be needed, not more programmers. The combined skill is extremely rare (and expensive).

And that’s exactly what I was suggesting: not to replace how contributors write the rules, but to give an extra option for LT to “use” the rules.

As @xavivars pointed out, this is only an additional format. So, all rule-writers can continue using XML. At the same time more complex rules can be written using Kotlin.
And yes, this would save a lot of build-time, because then all XML rules don’t have to be parsed each time; programmers’ time, because DSL on Kotlin gives a lot of power to one’s hands; and, even though it compiles to the same JVM bytecode, likely, run-time, since Kotlin has very effectively-implemented data structures, and it’s inline functions give it a real boost.

An example of LongSentenceRule’s match method, written in Kotlin. Doesn’t make a big difference logically, but we can embed this into DSL construction if we want to.

//Kotlin
fun match(sentence: AnalyzedSentence) throws IOException {
  val ruleMatches: List<RuleMatch> = emptyList()

  fun isTokenAppropriate(aToken: AnalyzedTokenReadings): Boolean {
    return aToken -> !aToken.isSentenceStart() &&
           !aToken.isSentenceEnd() &&
           !NON_WORD_REGEX.matcher(token).matches()
  }

  val tokens: AnalyzedTokenReadings[] = sentence.getTokensWithoutWhitespace()
                                                .filter(appropriateToken)
  String msg = getMessage();

  if(tokens.size <= maxWords) {
    return ruleMatches.toArray()
  }

  val errorToken = tokens[maxWords]
  val prevStartPos = if (maxWords > 1) tokens[maxWords - 1].getStartPos() else 0
  ruleMatches.add(RuleMatch(this, sentence, prevStartPos,
                  errorToken.getEndPos(), msg))
  return ruleMatches.toArray()
}
//Java:
public RuleMatch[] match(AnalyzedSentence sentence) throws IOException {
    List<RuleMatch> ruleMatches = new ArrayList<>();
    AnalyzedTokenReadings[] tokens = sentence.getTokensWithoutWhitespace();
    String msg = getMessage();
    if (tokens.length < maxWords + 1) {   // just a short-circuit
      return toRuleMatchArray(ruleMatches);
    } else {
      int numWords = 0;
      int startPos = 0;
      int prevStartPos;
      for (AnalyzedTokenReadings aToken : tokens) {
        String token = aToken.getToken();
        if (!aToken.isSentenceStart() && !aToken.isSentenceEnd() && !NON_WORD_REGEX.matcher(token).matches()) {
          numWords++;
          prevStartPos = startPos;
          startPos = aToken.getStartPos();
          if (numWords > maxWords) {
            RuleMatch ruleMatch = new RuleMatch(this, sentence, prevStartPos, aToken.getEndPos(), msg);
            ruleMatches.add(ruleMatch);
            break;
          }
        }
      }
    return toRuleMatchArray(ruleMatches);

Some rules’ code may be shortened to a few lines of code, using Kotlin’s syntax sugar and functional programming. I haven’t demonstrated that, I just took almost random rule and rewritten it to demonstrate an example.

For exploratory purposes, I would recommend to check out Kotlin. It might come in handy, even if not in this project.

There are two main solutions:

  • Updated XML files will be automatically compiled to Kotlin (either immediately, or each day, for example)
  • Use XML as a main source (thus, each new build translating to Kotlin and compiling it)

Anyway, LT users won’t confront this problem.