May 30th
- Add more comments for my work.
- Search resources for Chinese ngram.
May 30th
Thanks for the zip! The download took a while, but I’ll look into it
May 31th
I plan to train two models character based and word based. I need to figure out which is better.
June 1st
June 5th
I trained the model with part of zhwiki corpus(about 1G). After that, I got 3 models.
I think it is better to combine them together to detect errors.
Plan:
June 6th
June 6th
I have trained language models and write a python version detection to evaluate my idea and the models.
My idea is to combine bigram, trigram and 4gram to score every character in the sentence to figure out the potential error. It works good. Now I plan to start to implement it in java.
Do you have an evaluation, i.e. how “good” does it work?
Tested in two approach.
Data from <category name="成语错误" id="CN_CAT2">
of grammar.xml. All the pieces passed.
Misspelling error subset data from SIGAN2015. Test with 28 pieces, 24 passed.
I read from LT code base today. Here is my question.
arpa
files which includes ngram data.edu.berkeley.nlp.lm
provides a way to read that, but it is a little slow.I want to know how should I implement this rule (detector and corrector). Can I call python script in java to get suggestion? Or I need to write a java version rule instead?
Calling Python is not really an option (other than for testing). The rule can, like any rule, just extend Rule
and implement the match
method I think. How exactly the rule gets its data is up to you, feel free to use the approach we use for ngram data, because its Lucene index is both fast and memory friendly. But if you find a better approach, let us know.
June 11th
berkeleylm
provides a way to make arpa format language model to binary. That makes it faster.
Update skeleton codes of the new rule.
I was just going to try this, but it didn’t work. When I unzip this over LT 4.1 and replace all files except META-INF/org/languagetool/language-module.properties
, I get java.lang.NoClassDefFoundError: com/hankcs/hanlp/utility/SentencesUtil
. I could probably fix this myself, but maybe there’s an easy solution. Have you tried the JAR yourself?
Fixed now.
So I can re-download the JAR from the same link and it’s a fixed version?
In my case, I copy the files from language-zh.jar
to target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT
respectively. Then I can use java -jar languagetool-commandline -l zh-CN/zh-TW <text>
to check text.
It doesn’t work for you now?
How can I use this? java -jar language-zh-4.2-SNAPSHOT-jar-with-dependencies.jar
just prints Error: Could not find or load main class it.bitrack.main.Main
for me.
June 15th
June 20th
June 21th
RuleMatch ruleMatch = new RuleMatch(this, sentence, startPos, endPos, "");
in match
method.June 25th