Help: Add custom dictionary spellcheck - VOA English [Solved]

tathagat.banerjee · March 15, 2016, 6:54am

Hi,

Looking to add a custom spellcheck, specifically based on VOA English - http://docs.voanews.eu/en-US-LEARN/2014/02/15/7f8de955-596b-437c-ba40-a68ed754c348.pdf

Needed some assistance. So far, I have:

Created a dict file using the process in Spell check - LanguageTool Wiki - this seems to work fine and i have a en_VOA.dict

I also created a .info file using:
# Dictionary properties fsa.dict.separator=+ fsa.dict.encoding=iso-8859-1 fsa.dict.encoder=prefix

I placed the files en_VOA.dict and en_VOA.info in:
/Users/*****/workspace/LanguageTool-3.2/org/languagetool/resource/en/

This does not work. Assistance would be appreciated.

Basically VOA English has a much smaller set of words, ~ 1500. The idea is to highlight any word which is not in this set of words.

Thanks so much,
Tat

tathagat.banerjee · March 15, 2016, 6:57am

Also, as an add on question. The command in the documentation is:

java -cp languagetool.jar org.languagetool.dev.SpellDictionaryBuilder de-DE /path/to/dictionary.txt org/languagetool/resource/en/hunspell/en_US.info - -o /tmp/output.dict

Why is the above highlighted section de-DE? What exactly is that doing?

Thanks so much.

tathagat.banerjee · March 15, 2016, 7:03am

Hi,

Can i please get a opinion on the best way to retard the spell checker.

I think making up a new language for reducing the number of words may be too much.

On the other hand, if i simply replace one of the dict files, I loose the POS tag information, which means other rules will not work?

What is the exact process to create a new language? Is my understanding of the steps involved correct?

Create a input file <voa_input.txt> with the 1500 words.
Create a .dict file using:
java -cp languagetool.jar org.languagetool.dev.SpellDictionaryBuilder de-DE /path/to/dictionary.txt org/languagetool/resource/en/hunspell/en_US.info - -o /tmp/output.dict
a. What should i call this for the de-DE? I can see xx-XX is possible?
b. Where does the POS data go in for the 1500 words when I am creating the .dict file?

Alternately, do I just manually build a binary POS dictionary? That is, manually populate a text file (~1500 words only) with inflected, base and POS; and then run the POSDictionaryBuilder?

Finally - what is the binary synthesizer dictionary and how is it used?

Thanks so much.

Tat

dnaber · March 15, 2016, 1:12pm

Hi, I’m not sure whether a new spell checker dictionary is the right approach here. Any correct word that’s not in the list of 1500 words would be flagged as a spelling error, which doesn’t seem quite right. Instead, I’d suggest adding a Java rule that loads the valid words from a plain text file and complains about every word not in that list. This way you can also offer useful alternatives more easily, while a spell checker can only offer alternatives based on spelling.

Adding a language is documented at Adding A New Language - LanguageTool Wiki, but this might be overkill. If you add the new Java rule as described above and make it the only active rules that’s much easier.

Mike_Unwalla · March 15, 2016, 1:29pm

The idea is to highlight any word which is not in this set of words.

In disambiguation.xml, make a rule or rules to give the VOA words a special POS (say VOA_APPROVED).

In grammar.xml, make a rule that shows all words, except those that have the POS VOA_APPROVED.

That is the method I use for STE term checker.

tathagat.banerjee · March 16, 2016, 5:53am

Hi Daniel,

I have no experience in Java, and limited coding experience generally. Still, am very keen to learn. I will have a look at the DemoRule.java and try to understand how to do this. Thanks so much.

Tat

tathagat.banerjee · March 16, 2016, 6:04am

Hi Mike,

Your suggestion blew me away. I swear, started looking at LangaugeTool about a day ago. Here is what i did:

In disambiguation.xml -

tathagat.banerjee · March 16, 2016, 6:04am

In grammer.xml -

That worked as expected. I cannot believe this was so simple (2 and a bit hours and 1 question?). Whoever came up with LT is a smart smart person.

If i could get some assistance on the last bit. How do i do an exclusion. I tried to use antipattern instead of pattern, but this blew LT up.

Thanks so much.

Tat

Again - thank you for your previous suggestion.

dnaber · March 16, 2016, 8:13am

There are several ways to have exceptions, you can search for negate= or <exception> in Development Overview - LanguageTool Wiki. But what exactly happened when you used <antipattern>?

Mike_Unwalla · March 16, 2016, 8:31am

For the 1-word approved terms, a separate rule for each term is not necessary in disambiguation.xml. You can use:

<token regexp="yes" inflected="yes">about|book|man|woman|...</token>

For information about ‘inflected’ and ‘regexp’, refer to Development Overview - LanguageTool Wiki.

For a grammar rule that finds all the words that are not approved, you can use:

<token><exception postag="VOA_APPROVED"/></token>

tathagat.banerjee · March 21, 2016, 12:26am

Hi Daniel,

Just realized you are in the core team. Appreciate you looking at this.

grammer.xml

The error i get is:

Tathagats-iMac:LanguageTool-3.2 tathagatbanerjee$ java -jar languagetool.jar
java.lang.RuntimeException: java.lang.RuntimeException: Could not activate rules
    at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageToolSupport.java:310)
    at org.languagetool.gui.LanguageToolSupport.init(LanguageToolSupport.java:333)
    at org.languagetool.gui.LanguageToolSupport.<init>(LanguageToolSupport.java:146)
    at org.languagetool.gui.Main.createGUI(Main.java:322)
    at org.languagetool.gui.Main.access$1800(Main.java:53)
    at org.languagetool.gui.Main$7.run(Main.java:859)
    at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:311)
    at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:756)
    at java.awt.EventQueue.access$500(EventQueue.java:97)
    at java.awt.EventQueue$3.run(EventQueue.java:709)
    at java.awt.EventQueue$3.run(EventQueue.java:703)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:76)
    at java.awt.EventQueue.dispatchEvent(EventQueue.java:726)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:201)
    at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
    at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
    at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
Caused by: java.lang.RuntimeException: Could not activate rules
    at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:183)
    at org.languagetool.MultiThreadedJLanguageTool.<init>(MultiThreadedJLanguageTool.java:76)
    at org.languagetool.MultiThreadedJLanguageTool.<init>(MultiThreadedJLanguageTool.java:67)
    at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageToolSupport.java:292)
    ... 19 more
Caused by: java.io.IOException: Cannot load or parse input stream of '/org/languagetool/rules/en/grammar.xml'
    at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:76)
    at org.languagetool.Language.getPatternRules(Language.java:345)
    at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:328)
    at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:180)
    ... 22 more
Caused by: java.lang.IllegalStateException: Neither pattern tokens nor regex is set
    at org.languagetool.rules.patterns.PatternRuleHandler.createRules(PatternRuleHandler.java:550)
    at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:324)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(XMLDTDValidator.java:2054)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(XMLDTDValidator.java:2005)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(XMLDTDValidator.java:879)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1783)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2970)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:73)
    ... 25 more

Thanks so much,

Tat

tathagat.banerjee · March 21, 2016, 12:27am

Also, if needed:

disambiguation.xml

Looking at the stack trace, nothing seems obvious. How do i log the complete error in java nicely? I can send that to you if necessary.

Thanks so much.

Tat

tathagat.banerjee · March 21, 2016, 2:12am

Hi Mike,

I used <token><exception postag="VOA_APPROVED"/></token> and it worked really well. Thank you so much for the suggestion. The inflected and regex are also quite useful.

Can i ask though - the behavior of disambiguation.xml is puzzling.

A. Using a single line of words - this works fine. One issue here is that the second full stop is getting picked up.


  
    a|able|about|above|baby|back|bacteria|bad

B. Two lines of words - So then I tried to break up the rule to put in the exception for the various punctuation marks. Breaking up the words does not seem to work. None of the words got picked up.


  
    a|able|about|above
    baby|back|bacteria|bad

C. OR - Using OR does not seem to work either.


  
  
    a|able
    about|above

D. Exception - Similarly, using exception does not work either.


  
    a|able|about|above|baby|back|bacteria|bad                
    .|,|;

On the above, can I ask:

The STE dictionary you created, is every word in the same <token> tag? How did you handle punctuation marks or do these flag as errors?

Thanks so much.

Tat

P.S. I just discovered the </> button and it is like magic!

dnaber · March 21, 2016, 7:33am

This is the error: Every rule needs either a <pattern> or a <regexp> section, only using <antipattern> doesn’t work.

Mike_Unwalla · March 21, 2016, 9:50am

Method A is the correct method. Look at the postags in Tagger Result.The ones that you specified as VOA_APPROVED have that tag.Because you did not specify the punctuation marks as VOA_APPROVED, your grammar rule finds the token. To correct the problem, one method is to include an exception for punctuation marks.

Method B tries to match 2 tokens, the first from the set (a, able, about, above) and the second from the set (baby, back, bacteria, bad). Also, the postag is applied only to 1 token. To apply it to the 2 tokens, use:

<disambig action="add"><wd pos="VOA_APPROVED"/><wd pos="VOA_APPROVED"/>

(Also, related, look at ‘marker’ in the help.)

Method D. The exception must be on the first token.Currently, you try to match 2 tokens.

For the STE dictionary, there are many different rules to apply the postags. For example, I apply different postags for different types of STE term. Also, each multi-word terms must be in a separate rule.

The STE grammar rule that finds non-STE terms ignores 1-character tokens.

tathagat.banerjee · March 22, 2016, 1:24am

Thank you so much Mike. Very much appreciated. I got it to work using:

It is kind of dodgy, but i could not get [[:punct:]] to work for some reason.

Thanks so much.

Tat

Mike_Unwalla · March 22, 2016, 8:57am

Deleted: I did not read your message correctly. Sorry.

dnaber · March 22, 2016, 10:31am

[[:punct:]] isn’t Java syntax I think. You can see some differences in the table at Regex Tutorial - POSIX Bracket Expressions

tathagat.banerjee · March 29, 2016, 11:27pm

@dnaber, @Mike_Unwalla - you guys are both awesome. Thank you so much for the assistance.

LT is meeting my use case. I will continue to use this tool and hopefully can contribute over time. Thanks so much.

Tat