Creating a new language-datasheet module "best practice"?

Hi!

I am at the beginning of a larger project that involves consistency checks of data-sheets. Those checks will regard things like using the same wording for the same things and also try to unify units e.g. hertz/Hz, microampere/μA etc.

The language of those sheets is “en_US” but that could change. My question now would be what would be the best way to start up with a new module e.g. language-datasheet in the LanguageTool environment.

Will I have to create a language-en-datasheet module or will I be able to generalize those consistency rules and make it somehow independent from the language? In my “dream world” I’m having a language-datasheet module which I could use for multiple languages in the end but is that possible with the LanguageTool architecture?

I just started to dig into LanguageTool so I hope that this is not a stupid question. As far as my question is clear I’d be glad if somebody could help me to get started here.

Thank you and best regards,

Stefan Falk

Hi Stefan,

thanks for your interest in LT. If your rules will be pattern rules in XML, you might want to have a look here: Tips and Tricks - LanguageTool Wiki. You should be able to use one XML rule file which you can include from the XML rule files of several languages.

I’m not sure what rules you need: only your own rules, or your own rules in addition to what LT can do?

Regards
Daniel

Hi! Thanks for answering!

Well, let’s say you want to detect inconsistency errors in a technical document that describes a piece of hardware - e.g. a microcontroller. The best case scenario would look something like having some semantic rules (assume I got an oracle here) that tells when to use V_{BAT} (LaTeX) which refers to e.g. input voltage and when to use just VBAT to refer to a specific Pin of that controller.

It might even be necessary to dynamically load some antipatterns e.g. VBAT|V_bat|V_Bat etc. that might be generated from that document being processed.

The question is how I could combine my semantical rules with LT - if that’s even possible.

Another way of looking at it is: I don’t just want to detect grammatical errors, I want to extend it to detect inconsistencies.

Best regards,
Stefan

If you’re using LT via the Java API you can simply add rules dynamically. Load them with PatternRuleLoader from an XML file and add them with JLanguageTool.addRule(). If you’re not using the API you can modify the grammar.xml files directly. If you don’t want to do that you can add a language module, like we did for simple German: languagetool/languagetool-language-modules/de-DE-x-simple-language at master · languagetool-org/languagetool · GitHub

Alright, thank you!

That would mean I could also “activate” and “deactivate” rules e.g. based on context, right? Assume I know that I’m currently in the chapter “Introduction” where different rules apply than in general chapters.

If I may ask another thing: What would be a good approach to detect the following. Anything that is a quantitative declaration may not have determinatives like “some µA” is not allowed. The unit is of course exchangeable (mm, A, min, sec, etc.). Is it better to put that into the grammar.xml file or would it be more practicable to implement that as a Java rule?

I hope these kind of questions are okay here in this forum. Please let me know if my questions are off-topic here.

Best regards,
Stefan

Yes, if you know the context you can call LT several times with different rules, once per context.

“some µA” can probably be detected with a rule in grammar.xml. The first token could be a regex that matches anything that’s not a number, the second could be a regex that lists all the units.

Alright, thank you very much for your help!

Just one last thing. I’m trying to create an additional module “language-technical-datasheet” and add it to the standalone [1].

I’m getting “No language file available named techsheet at languages/techsheet!” as I try to start it obviously because there is no dictionary “techsheet”.

My question would be if it is possible to write a universal extension for all languages or would I have to extend each language I want to support with data sheet specific rules? Like for SimpleGerman but in this case I’d have:

GerTechnicalDatasheet extends German
EnTechnicalDatasheet extends English

etc.

[1] TechnicalDatasheet.java

public class TechnicalDatasheet extends Language {
@Override
public String getShortName() {
return “techsheet”;
}
@Override
public String getName() {
return “Technical Datasheet (TS)”;
}
@Override
public String[] getCountries() {
return new String[] {“GB”};
}
@Override
public Contributor[] getMaintainers() {
return new Contributor[] { new Contributor(“Stefan Falk”) };
}
@Override
public List getRelevantRules(ResourceBundle messages)
throws IOException {
return null;
}
}

Could you send the complete stacktrace of the error? It doesn’t look familiar.

There’s a method getRuleFileNames() that you can overwrite that returns the grammar.xml to be used. But if you want to keep the spell checking of the languages, I guess you’ll have to write one sub class per language (I’m not 100% sure).

Hi!

I think I just used it wrong. I now extend e.g. EnglishRule and use just the English dictionary - I think that’s going to work for me :slight_smile:

Okay I already said “one more thing” but there’s just one more thing …

I can see that the grammar.xml rules are getting unmarshalled to PatternRule objects. In JLanguageTool.java the method activateDefaultPatternRules() does that for example.

I was wondering if I can add PatternRules programmatically myself. The grammar.xml would be quite “static”. What if I e.g. would like to replace a regex pattern “µA|mA|A” by “µV|mV|V” - I know I could just add all those rules or concatenate the pattern in the grammar file - but since I’m going to have a few context-based rules I would like to add/remove or activate/reactivate those rules on demand.

Would that be possible or is there a place where I can access JLanguageTool.addRule() in a language module e.g. “language-en”?

A PatternRule is just a common class, so you can create objects and call addRule(myRule). That’s about the same as adding a rule to grammar.xml, just programmatically. But you could also add all rules to grammar.xml and then enable/disable them programmatically as you need them.

I just saw that I can do the adding part in my extended language:

public class TechnicalDatasheet extends English {

  @Override
  public String[] getCountries() {
    return new String[]{"DS"}; // Not really a country but a Domain (Datasheet)
  }
  
  @Override
  public String getName() {
    return "Technical Datasheet";
  }
  
  @Override
  public List<Rule> getRelevantRules(ResourceBundle messages) throws IOException {
    List<Rule> rules = new ArrayList<>();
    rules.addAll(super.getRelevantRules(messages));
    // ADD RULES
    return rules;
  }

}

The only thing missing now would be how to deactivate a rule on demand e.g. based on a context.

I was assuming you created a JLanguageTool object yourself. With that you can activate/deactivate rules. If you don’t do that (and just use an extended LT on the command-line or via GUI), you cannot activate/deactivate rules dynamically.

Oh alright - that makes sense. For the start I just extended everything step by step to have my “English (Datasheet)”-Language in the standalone. I see, there I won’t be able to have easy access to the JLanguageTool instance if at all.

I think I’m slowly getting into it :slight_smile: