Information Extractor from Wikipedia corpus

gsider · June 6, 2021, 7:12pm

@antonopoulosn and I created a SpecificCaseRule for Greek. But in order not to hard code the txt file with the corresponding expressions we decided to extract the most frequent expressions in a specific case from the wikipedia corpus.
Would the java code that we used be useful for LanguageTool? and would you like us to push it?

Thank you in advance!

dnaber · June 6, 2021, 7:35pm

That sounds like a good fit for languagetool-dev, the place where we put code that doesn’t run at runtime, but that it used by developers: languagetool/languagetool-dev at master · languagetool-org/languagetool · GitHub

gsider · June 6, 2021, 7:47pm

That’s great! Do you think that it would be useful if we created an abstract class like WikiInfoExtractor, that every developer can extend in order to extract the info they need? or there is no point in that and we should move on and contribute only a single class for the particular purpose mentioned above?

dnaber · June 6, 2021, 8:11pm

We already have WikipediaSentenceExtractor, for extracting the text. Your case is about extracting titles? A single class might be good enough, and languagetool-wikipedia would probably be the right place, contrary to what I said above.

gsider · June 6, 2021, 8:28pm

No, it’s not about titles, the particular class is about getting the most frequent expressions that need to be written with capital in the beginning (english ex. New York).
The idea behind the abstract class I mentioned above is to automatically generate the txt files that the rules use, by finding the most frequent cases that the rule applies using the wikipedia corpus (instead of hard coding them). Although, this is going to make the creation of a new rule a bit more demanding and time-consuming.