Profiling rules in languagetool-commandline

I was trying to take a shot at profiling my rules via -p argument in commandline tools but I see that most of them produce very close results. Looking at the code I realized that we include analyzeSentence step in the timing for each rule.
But if disambiguator has dozens of rules in it then the time for the actual checking rule would not have much impact on timing.
Would it make sense to separate the analyze step out of the benchmark? Perhaps with its own benchmark?
If yes I could prepare the change.

Thanks
Andriy

Is anybody using rule profiling?
Can I separate rule timing from tagging/disabmig?
Also can I add timing for tagging/disabmig as well?

I’ve written some benchmarking code that measures only the rule matching step (i.e. analyzing is done once and not included in the measurement), but I haven’t gotten around to documenting it yet and preparing a pull request.

If you still want to have a look already:
Here’s the code: Commits · fabrichter/languagetool · GitHub
Here’s how you can run it (I’ll try add better documentation soon):

java -DbenchmarkLanguages="en-US" -DbenchmarkData=data/ -DbenchmarkResults=results/ -DngramIndex=ngram-data/ -cp benchmarks.jar:libs/lucene-backward-codecs.jar org.languagetool.RuleBenchmark

I’ve pushed small improvements to the rule profiling in command-line:

  • time disambiguation step separately
  • use fixed column width for rule timing table

P.S. just realized I cut down iteration count to 3 (profiling was taking really long time for me), we could increase it back, but I’d say 5 should be enough (10 is probably overkill and makes it too long even for small texts)