There’s a thing confusing me.
The stored pair
<not corrected sentence, corrected sentence> may be impossible to reproduce using the current version of the LT (e.g. the pair was received using the old version of the LT and the rules are changed now and the current version of the LT suggests other replacements etc). How possible do you think it is?
Since I cannot explore that private stored data, I could provide the command-line tool to check whether the mentioned problem exists. What do you think about?
There’s a thing confusing me.
I’m not sure I understand the problem: We have the original sentence and a corrected sentence (plus some meta data like rule id), why would it be necessary to reproduce the correction? Don’t you want to use “old” suggestions because LT might have become better by now?
Having the pair of sentence and the correction I’d like to receive all the suggestions given by the LT. This is the simplest way to receive the data of the following format:
|typo||features||suggestion||was selected by user|
It’s handy to have the data in this form when training the model.
To get this data I want to send the bad sentence to the LT, receive all the replacements suggested and mark those that were selected by user and those that weren’t. But I think that some suggestion mechanisms were changed and that can influence the mentioned workflow. It’s interesting to explore the scale of that problem.
I see. It’s hard to tell how much the suggestions change. Probably not that much. Here’s how we log data now (leaving out the sentence here):
+----------------+-----------------+----------------+ | suggestion_pos | covered | replacement | +----------------+-----------------+----------------+ | 0 | rhe | the | | 0 | womens | women | | 0 | frustated | frustrated | | 2 | litteraly | literally |
suggestion_pos is the position of the selected suggestion in the list. It has a special case of
99 for those cases where the user doesn’t use one of the suggestions, but types in their own.
So finally can I count on the following format of the data?
original sentence | corrected sentence | suggestion position | covered | replacement. Or something mentioned in this list is missing in the logged data and vice versa?
Yes, you can assume that. There’s more metadata in there, but I don’t think it can be used now.
I’n generating the prototyped test data and prototyping the models comparing tool so any kind of info about the data format is welcome
I’m asking not for the data, just for the format of that data.
I get this error (I’ve added output the about the sentence):
sentence : En Venezuela es común que los hijos se independicen hasta que se casan. correction: En Venezuela es común que los hijos se independicen hasta que se casein. java.lang.StringIndexOutOfBoundsException: String index out of range: 72 at java.lang.String.substring(String.java:1963) at io.github.oserikov.languagetool.Utils.startOfErrorString(Utils.java:44) at io.github.oserikov.languagetool.Main.processRow(Main.java:187) at io.github.oserikov.languagetool.Main.processDBData(Main.java:104) at io.github.oserikov.languagetool.Main.main(Main.java:69)
Aww, my bad. Will fix in a couple of minutes.
Could you, please, run the following query
SELECT COUNT(*) FROM corrections WHERE language = 'ru-RU' AND rule_id = 'MORFOLOGIK_RULE_RU_RU'
on the logs database?
The language code is just
ru-RU), but then I get:
Ok, thank you!
Could you, please, run the updated features extractor? I’ve added the suggestion position extraction (forgot to do that earlier).
Done, sent the result via private message.
The features seem to be shuffled, but to order the corrections by model’s score it’s handy to be able to group all the suggestions for the same sentence together, so I’ve added an
id column – a hash value unique for each group of corrections.
I’m looking for a way to bind each
MORFOLOGIK_RULE_%_% rule id with an
org.languagetool.language.Language subclass programmatically. Maybe someone used to do that before?
@dnaber, Could you, please,
SELECT DISTINCT language, rule_id FROM corrections ?
Result sent via private message