Example 'generator'

Ruud_Baars · March 11, 2018, 5:28pm

One of the most tedious tasks working on rules is finding real life examples and adding those.
Using LT as a local server, the api and an enormous text file with sentences, a PHP program writes the api output; a different one ‘flattens’ this to records, a third one transforms these into prototype examples and adds those to the grammar.xml as comment in the appropriate rule.

The amount of example is limited to 10% of the found errors, 10 when possible, not more then 100 anyway. Shortest examples get the priority, selecting different marked areas too.

It is far from perfect. There are some assumptions in it, and the way I change the xml is certainly not professional programming. But if anyone is interested, feel free to contact me.

tiagosantos · March 11, 2018, 9:29pm

It could use an opt-in feature to be done, but I almost bet that sensitive information (even if not necessarily private) would be pushed to git permanent record and eventually distributed via daily builds and/or standard releases. How does one mitigate that?

SkyCharger001 · March 11, 2018, 10:50pm

most sensitive information requires more than one sentence for context, so I think you could simply do a ‘distributed spread’ for selecting the sentences to be used… ‘destroying’ the sensitive information in the progress.

Ruud_Baars · March 12, 2018, 6:43am

If there is sensitive info, it is also in the corpus.
The examples have to be edited anyway, because most will be too long of will contain multiple errors.
you could easily remove the commented examples before distribution.
If you think it is not useful, just don’t use it.
And a, great plus: it shows false positives as well, a good way to improve the rule with exceptions

Ruud_Baars · March 12, 2018, 6:54am

Do you feel like helping to get more real life examples in the grammar.xml? In that case I will submit a version soon.

tiagosantos · March 12, 2018, 6:52pm

@Ruud_Baars

Not at all. This is great. It is great to make sure that there are no significant regressions.
I have misinterpreted and thought this would be fed with the online queries, and they could be exploited to cause hindrance to the project.

Ruud_Baars · March 12, 2018, 7:00pm

You can check it out in the current Dutch grammar.xml. There is a lot of comments in there, marked as <!-- possible examples .

I am working through those from the top downwards, leaving at least 5 examples (when available).