- Add more comments for my work.
- Search resources for Chinese ngram.
Thanks for the zip! The download took a while, but I’ll look into it
I plan to train two models character based and word based. I need to figure out which is better.
I trained the model with part of zhwiki corpus(about 1G). After that, I got 3 models.
I think it is better to combine them together to detect errors.
I have trained language models and write a python version detection to evaluate my idea and the models.
My idea is to combine bigram, trigram and 4gram to score every character in the sentence to figure out the potential error. It works good. Now I plan to start to implement it in java.
Do you have an evaluation, i.e. how “good” does it work?
Tested in two approach.
<category name="成语错误" id="CN_CAT2"> of grammar.xml. All the pieces passed.
Misspelling error subset data from SIGAN2015. Test with 28 pieces, 24 passed.
I read from LT code base today. Here is my question.
arpafiles which includes ngram data.
edu.berkeley.nlp.lmprovides a way to read that, but it is a little slow.
I want to know how should I implement this rule (detector and corrector). Can I call python script in java to get suggestion? Or I need to write a java version rule instead?
Calling Python is not really an option (other than for testing). The rule can, like any rule, just extend
Rule and implement the
match method I think. How exactly the rule gets its data is up to you, feel free to use the approach we use for ngram data, because its Lucene index is both fast and memory friendly. But if you find a better approach, let us know.
berkeleylm provides a way to make arpa format language model to binary. That makes it faster.
Update skeleton codes of the new rule.
I was just going to try this, but it didn’t work. When I unzip this over LT 4.1 and replace all files except
META-INF/org/languagetool/language-module.properties, I get
java.lang.NoClassDefFoundError: com/hankcs/hanlp/utility/SentencesUtil. I could probably fix this myself, but maybe there’s an easy solution. Have you tried the JAR yourself?
So I can re-download the JAR from the same link and it’s a fixed version?
In my case, I copy the files from
target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT respectively. Then I can use
java -jar languagetool-commandline -l zh-CN/zh-TW <text> to check text.
It doesn’t work for you now?
How can I use this?
java -jar language-zh-4.2-SNAPSHOT-jar-with-dependencies.jar just prints
Error: Could not find or load main class it.bitrack.main.Main for me.
RuleMatch ruleMatch = new RuleMatch(this, sentence, startPos, endPos, "");in