Idea: add a timestamp to each rule upon creation

Jan_Schreiber · February 10, 2019, 11:02pm

This GitHub issue (specific to German):

gave me an idea.

Rules have a tendency to become outdated over the years, so we might end up suggesting exactly the wrong thing in a few cases someday. After every spelling reform, we should be able to go through the rules and modify the outdated ones.

The rule element in XML could look like this in the future: <rule id="x" name="y" timestamp="z">. The timestamp attribute should be mandatory. The existing rules could get the timestamp 12-31-2018, for example.

This might help with identifying outdated rules.

Dallun511 · February 10, 2019, 11:10pm

@Jan_Schreiber Das die Regeln in der ID einen Zeitstempeln halte ich für sehr sinnvoll, denn damit lässt sich auch sehr leicht feststellen, zu welcher Reform sie erstellt wurden.

Zudem würde ich noch empfehlen, dass die Regeln noch einen Zeitstempel für die letzte Bearbeitung kriegt, wenn das möglich ist.

Grüße Dallun511

Mike_Unwalla · February 11, 2019, 8:44am

@Jan_Schreiber, I like your idea.

I wonder whether it is possible to automate the time stamp so that the rule writer/editor does not need to do anything.

In addition, I would like much better project management for the rules that we create. Who wrote the rule? What are the changes to the rule over time?

Ruud_Baars · February 11, 2019, 8:58am

Some version management would be okay, but may be hard to do, since there is a strong interaction between code, dictionaries and rules.

But it could be a lot of overhead too.
One could simply start by agreeing on doing part of this. But if all code is local and editable, there is no way to enforce it.

Mike_Unwalla · February 11, 2019, 6:50pm

Possibly not. We should not force users to change their rules. We could have check in testrules as an alternative to a mandatory timestamp.

tiagosantos · February 11, 2019, 9:21pm

This looks like a good idea. During daily regression tests chain, rule ID list logs can be ‘diffed’ against the former day. This will show if any rule was added or removed. This can be added to a separate file that shows this cumulative log with dates and rules per language per day.

arysin · February 13, 2019, 6:03pm

Can’t (should?) this information be obtained from git history?
Sometimes rule was created long time ago but it’s been adjusted often. Making editor to remember adjusting timestamp attribute is hard, while git can give information on exactly which lines changed when (and by whom).

dnaber · February 13, 2019, 7:04pm

We now have some external person to work on something related: a script that detects changes in rule matching. It’s basically like the nightly diff, but it can clearly detect whether a rule match is new, has been removed, or has been modified (whereas the diff script often gets confused by bigger changes in the output).

tiagosantos · February 14, 2019, 12:36pm

It can. Maybe it is just me, but I am unable (unwilling) to find when someone created or changed a specific rule when a file has hundreds or thousands of commits, since I would have to review all the [xx] rule added/improved commits (and all other commits with unrelated/inappropriate descriptions). This would be easy if there were git enforceable rules for commit naming, but I am unaware of such thing.

Great!

dnaber · February 14, 2019, 1:18pm

Just an idea (i.e. I don’t have plans to work on this): git is very fast, so one should be able to write a script that walks backwards in time through all commits of a grammar.xml, extract one specific rule (by id), formats it as normalized XML (use same indentation every time) and then print all the versions of that rule. The script can stop once it doesn’t find the rule anymore. It would work on rule ids, so this won’t work if a rule id gets changed.

tiagosantos · February 14, 2019, 4:12pm

Regarding the indentation, there is already a standardized rule and enforcement script on Portuguese and Galician. I would like to see such standardisation project wide. Much easier to port rules to other languages without indentation complains.
The main issue I see with that idea, would be to keep the comments intact, although, if the script only checks rule IDs, it would not need to mess with them. Looking forward to see it.

dnaber · February 14, 2019, 4:18pm

I agree this makes sense. Where is that script and how is it enforced? A proper solution would be to enforce it on commit.

arysin · February 15, 2019, 2:29am

There are git blame and git bisect commands, those could be useful

tiagosantos · February 17, 2019, 6:50pm

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/indent.sh

echo 'Please wait...'

sed -ri 's/([^ ])(<exception)/\1\n            \2/' $@

sed -ri 's/^[ \t]*(<\/?category)/ \1/' $@
sed -ri 's/^[ \t]*(<\/?(rulegroup|!DOCTYPE|phrases|unification))/  \1/' $@
sed -ri 's/^[ \t]*(<\/?(rule[ >]|!--|!ENTITY|phrase[ >]|equivalence))/    \1/' $@
sed -ri 's/^[ \t]*(<\/?(marker|suggestion|and|or|wd))/        \1/' $@
sed -ri 's/^[ \t]*(<\/?(antipattern|pattern|regexp|filter|message|url|short|example|disambig|includephrases))/      \1/' $@
sed -ri 's/^[ \t]*(<\/?(token|unify|phraseref))/          \1/' $@
sed -ri 's/^[ \t]*(<\/?(exception|feature))/            \1/' $@

sed -ri 's/(<(token|exception|suggestion|match|disambig|feature|phraseref|wd)[^>]*?)><\/\2>/\1\/>/' $@
sed -ri 's/[ \t]+\r?$/\r/' $@
sed -ri 's/" >/">/' $@
sed -ri 's/ \/>/\/>/' $@
sed -ri 's/\.\.\.<\/example>/…<\/example>/' $@
sed -ri 's/<example>\.\.\./<example>…/' $@

echo $@' indented'

Just run it on the grammar/disambiguation file you want. I prefered double spaces to tabs for level indentation. If you like the solution project wide, it can be added to the script you use when readying a new LT version. This corrects the entire file.

They are useful, but they are not exactly that and its usage is not nearly as convenient or easy to audit.