Back to LanguageTool Homepage - Privacy - Imprint

Help: Add custom dictionary spellcheck - VOA English [Solved]


(Tathagat Banerjee) #1

Hi,

Looking to add a custom spellcheck, specifically based on VOA English - http://docs.voanews.eu/en-US-LEARN/2014/02/15/7f8de955-596b-437c-ba40-a68ed754c348.pdf

Needed some assistance. So far, I have:

Created a dict file using the process in http://wiki.languagetool.org/hunspell-support - this seems to work fine and i have a en_VOA.dict

I also created a .info file using:
# Dictionary properties fsa.dict.separator=+
fsa.dict.encoding=iso-8859-1 fsa.dict.encoder=prefix

I placed the files en_VOA.dict and en_VOA.info in:
/Users/*****/workspace/LanguageTool-3.2/org/languagetool/resource/en/

This does not work. Assistance would be appreciated.

Basically VOA English has a much smaller set of words, ~ 1500. The idea is to highlight any word which is not in this set of words.

Thanks so much,
Tat


(Tathagat Banerjee) #2

Also, as an add on question. The command in the documentation is:

java -cp languagetool.jar org.languagetool.dev.SpellDictionaryBuilder de-DE /path/to/dictionary.txt org/languagetool/resource/en/hunspell/en_US.info - -o /tmp/output.dict

Why is the above highlighted section de-DE? What exactly is that doing?

Thanks so much.


(Tathagat Banerjee) #3

Hi,

Can i please get a opinion on the best way to retard the spell checker.

I think making up a new language for reducing the number of words may be too much.

On the other hand, if i simply replace one of the dict files, I loose the POS tag information, which means other rules will not work?

What is the exact process to create a new language? Is my understanding of the steps involved correct?

  1. Create a input file with the 1500 words.
  2. Create a .dict file using:
    java -cp languagetool.jar org.languagetool.dev.SpellDictionaryBuilder de-DE /path/to/dictionary.txt org/languagetool/resource/en/hunspell/en_US.info - -o /tmp/output.dict
    a. What should i call this for the de-DE? I can see xx-XX is possible?
    b. Where does the POS data go in for the 1500 words when I am creating the .dict file?

Alternately, do I just manually build a binary POS dictionary? That is, manually populate a text file (~1500 words only) with inflected, base and POS; and then run the POSDictionaryBuilder?

Finally - what is the binary synthesizer dictionary and how is it used?

Thanks so much.

Tat


(Daniel Naber) #4

Hi, I'm not sure whether a new spell checker dictionary is the right approach here. Any correct word that's not in the list of 1500 words would be flagged as a spelling error, which doesn't seem quite right. Instead, I'd suggest adding a Java rule that loads the valid words from a plain text file and complains about every word not in that list. This way you can also offer useful alternatives more easily, while a spell checker can only offer alternatives based on spelling.

Adding a language is documented at http://wiki.languagetool.org/adding-a-new-language, but this might be overkill. If you add the new Java rule as described above and make it the only active rules that's much easier.


(Mike Unwalla) #5

The idea is to highlight any word which is not in this set of words.

In disambiguation.xml, make a rule or rules to give the VOA words a special POS (say VOA_APPROVED).

In grammar.xml, make a rule that shows all words, except those that have the POS VOA_APPROVED.

That is the method I use for STE term checker.


(Tathagat Banerjee) #6

Hi Daniel,

I have no experience in Java, and limited coding experience generally. Still, am very keen to learn. I will have a look at the DemoRule.java and try to understand how to do this. Thanks so much.

Tat


(Tathagat Banerjee) #7

Hi Mike,

Your suggestion blew me away. I swear, started looking at LangaugeTool about a day ago. Here is what i did:

In disambiguation.xml -


(Tathagat Banerjee) #8

In grammer.xml -

That worked as expected. I cannot believe this was so simple (2 and a bit hours and 1 question?). Whoever came up with LT is a smart smart person.

If i could get some assistance on the last bit. How do i do an exclusion. I tried to use antipattern instead of pattern, but this blew LT up.

Thanks so much.

Tat

Again - thank you for your previous suggestion.


(Daniel Naber) #9

There are several ways to have exceptions, you can search for negate= or <exception> in http://wiki.languagetool.org/development-overview. But what exactly happened when you used <antipattern>?


(Mike Unwalla) #11

For the 1-word approved terms, a separate rule for each term is not necessary in disambiguation.xml. You can use:

<token regexp="yes" inflected="yes">about|book|man|woman|...</token>

For information about 'inflected' and 'regexp', refer to http://wiki.languagetool.org/development-overview.

For a grammar rule that finds all the words that are not approved, you can use:

<token><exception postag="VOA_APPROVED"/></token>

(Tathagat Banerjee) #12

Hi Daniel,

Just realized you are in the core team. Appreciate you looking at this.

grammer.xml

The error i get is:

Tathagats-iMac:LanguageTool-3.2 tathagatbanerjee$ java -jar languagetool.jar
java.lang.RuntimeException: java.lang.RuntimeException: Could not activate rules
    at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageToolSupport.java:310)
    at org.languagetool.gui.LanguageToolSupport.init(LanguageToolSupport.java:333)
    at org.languagetool.gui.LanguageToolSupport.<init>(LanguageToolSupport.java:146)
    at org.languagetool.gui.Main.createGUI(Main.java:322)
    at org.languagetool.gui.Main.access$1800(Main.java:53)
    at org.languagetool.gui.Main$7.run(Main.java:859)
    at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:311)
    at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:756)
    at java.awt.EventQueue.access$500(EventQueue.java:97)
    at java.awt.EventQueue$3.run(EventQueue.java:709)
    at java.awt.EventQueue$3.run(EventQueue.java:703)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:76)
    at java.awt.EventQueue.dispatchEvent(EventQueue.java:726)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:201)
    at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
    at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
    at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
Caused by: java.lang.RuntimeException: Could not activate rules
    at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:183)
    at org.languagetool.MultiThreadedJLanguageTool.<init>(MultiThreadedJLanguageTool.java:76)
    at org.languagetool.MultiThreadedJLanguageTool.<init>(MultiThreadedJLanguageTool.java:67)
    at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageToolSupport.java:292)
    ... 19 more
Caused by: java.io.IOException: Cannot load or parse input stream of '/org/languagetool/rules/en/grammar.xml'
    at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:76)
    at org.languagetool.Language.getPatternRules(Language.java:345)
    at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:328)
    at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:180)
    ... 22 more
Caused by: java.lang.IllegalStateException: Neither pattern tokens nor regex is set
    at org.languagetool.rules.patterns.PatternRuleHandler.createRules(PatternRuleHandler.java:550)
    at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:324)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(XMLDTDValidator.java:2054)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(XMLDTDValidator.java:2005)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(XMLDTDValidator.java:879)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1783)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2970)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:73)
    ... 25 more

Thanks so much,

Tat


(Tathagat Banerjee) #13

Also, if needed:

disambiguation.xml

Looking at the stack trace, nothing seems obvious. How do i log the complete error in java nicely? I can send that to you if necessary.

Thanks so much.

Tat


(Tathagat Banerjee) #14

Hi Mike,

I used <token><exception postag="VOA_APPROVED"/></token> and it worked really well. Thank you so much for the suggestion. The inflected and regex are also quite useful.

Can i ask though - the behavior of disambiguation.xml is puzzling.

A. Using a single line of words - this works fine. One issue here is that the second full stop is getting picked up.

<rule name="VOA" id="VOA">
  <pattern case_sensitive="no">
    <token regexp="yes" inflected="yes">a|able|about|above|baby|back|bacteria|bad</token>
  </pattern>
  <disambig action="add"><wd pos="VOA_APPROVED"/>
  </disambig>
</rule>

B. Two lines of words - So then I tried to break up the rule to put in the exception for the various punctuation marks. Breaking up the words does not seem to work. None of the words got picked up.

<rule name="VOA" id="VOA">
  <pattern case_sensitive="no">
    <token regexp="yes" inflected="yes">a|able|about|above</token>
    <token regexp="yes" inflected="yes">baby|back|bacteria|bad</token>
  </pattern>
  <disambig action="add"><wd pos="VOA_APPROVED"/>
  </disambig>
</rule>

C. OR - Using OR does not seem to work either.

<rule name="VOA" id="VOA">
  <pattern case_sensitive="no">
  <or>
    <token inflected="yes" regexp="yes">a|able</token>
    <token inflected="yes" regexp="yes">about|above</token>
  </or>
  </pattern>
  <disambig action="add"><wd pos="VOA_APPROVED"/>
  </disambig>
</rule>

D. Exception - Similarly, using exception does not work either.

<rule name="VOA" id="VOA">
  <pattern case_sensitive="no">
    <token regexp="yes" inflected="yes">a|able|about|above|baby|back|bacteria|bad                </token>
    <token regexp="yes"><exception>.|,|;</exception></token>
  </pattern>
  <disambig action="add"><wd pos="VOA_APPROVED"/></disambig>        
</rule>

On the above, can I ask:

  • The STE dictionary you created, is every word in the same <token> tag? How did you handle punctuation marks or do these flag as errors?

Thanks so much.

Tat

P.S. I just discovered the </> button and it is like magic!


(Daniel Naber) #15

This is the error: Every rule needs either a <pattern> or a <regexp> section, only using <antipattern> doesn't work.


(Mike Unwalla) #16

Method A is the correct method. Look at the postags in Tagger Result.The ones that you specified as VOA_APPROVED have that tag.Because you did not specify the punctuation marks as VOA_APPROVED, your grammar rule finds the token. To correct the problem, one method is to include an exception for punctuation marks.

Method B tries to match 2 tokens, the first from the set (a, able, about, above) and the second from the set (baby, back, bacteria, bad). Also, the postag is applied only to 1 token. To apply it to the 2 tokens, use:

<disambig action="add"><wd pos="VOA_APPROVED"/><wd pos="VOA_APPROVED"/>

(Also, related, look at 'marker' in the help.)

Method D. The exception must be on the first token.Currently, you try to match 2 tokens.

For the STE dictionary, there are many different rules to apply the postags. For example, I apply different postags for different types of STE term. Also, each multi-word terms must be in a separate rule.

The STE grammar rule that finds non-STE terms ignores 1-character tokens.


(Tathagat Banerjee) #17

Thank you so much Mike. Very much appreciated. I got it to work using:

<pattern>
  <token>
    <or>
      <exception postag="VOA_APPROVED"/>
      <exception regexp="yes">.{1}</exception>
    </or>
  </token>
</pattern>

It is kind of dodgy, but i could not get [[:punct:]] to work for some reason.

Thanks so much.

Tat


(Mike Unwalla) #18

Deleted: I did not read your message correctly. Sorry.


(Daniel Naber) #19

[[:punct:]] isn't Java syntax I think. You can see some differences in the table at http://www.regular-expressions.info/posixbrackets.html


(Tathagat Banerjee) #20

@dnaber, @Mike_Unwalla - you guys are both awesome. Thank you so much for the assistance.

LT is meeting my use case. I will continue to use this tool and hopefully can contribute over time. Thanks so much.

Tat