Adding a new language Hindi

pratikbsp · March 12, 2015, 6:30pm

Hello,
I am Pratik.I find languageTool interesting.So I would like to contribute to languageTool by adding a language Hindi.Hindi is a language which is used by nearly 497 million people for their general use.I will be very pleased to contribute to languageTool and be a part of languageTool community.
Please guide me how do I contribute to languageTool.

dnaber · March 12, 2015, 8:01pm

Hi Pratik, thanks for your interest in LanguageTool. Please see our introduction at Development Overview - LanguageTool Wiki for a start. As Hindi is not supported yet, it cannot be selected as a language in LT, but you can write error detection rules in any other language with a similar tokenization (i.e. any language that uses whitespace between words, as most languages do).

So as a first step you can write simple rules that refer to words, not yet to the words’ part-of-speech. Later a dictionary could be added for that. We’d be happy to welcome you as a contributor, just let us know if you have any questions (here or on our mailing list).

Regards
Daniel

pratikbsp · March 12, 2015, 10:02pm

Is their any way to add hindi ?

dnaber · March 12, 2015, 10:15pm

It’s documented at Adding A New Language - LanguageTool Wiki. But I recommend to first write your rules in the grammar.xml file of a different language. Once you’ve written some rules, the developers can help you with this step if you’re not familiar with Java.

pratikbsp · March 15, 2015, 7:35am

I’m a java developer.How do i begin to write grammer.xml file.The languages other then Hindi I know is English.Should I first write grammer.xml files for english and then for Hindi?

dnaber · March 15, 2015, 9:24am

I suggest you take the English grammar.xml, delete the English rules and add your rules there for Hindi. Later, when you have added Hindi as a language, you can copy your rules to the Hindi grammar.xml. Have you already forked LanguageTool at github? I suggest you do so, and work in that fork until Hindi is ready to be included into the “official” LanguageTool.

How to write rules is documented at Development Overview - LanguageTool Wiki.

pratikbsp · March 15, 2015, 3:16pm

I added few rules but when i tried to open languagetool.jar it showed error.Whereas when rules were in hindi it worked fine.
For e.g. rule looks alike:-

<rule name="संभव मुद्रण गलती" type="गलत वर्तनी">    
        
    <rule id="कि कुंजि" name="कि कुंजि  (की कुंजी)">    
                <pattern>
                    <token >कि</token>
                    <token >कुंजि</token>
                </pattern>
                <message>क्या आपका तात्पर्य <suggestion><match no="1"/>की कुंजी </suggestion>से हैं ?</message>
                <example type="correct">समय, सफलता  <marker>की कुंजी </marker>है।</example>
                <example type="incorrect" correction="की कुंजी">समय, सफलता <marker>कि कुंजि</marker>है।</example>
            </rule>

	</category>

dnaber · March 15, 2015, 3:57pm

What was the exact error message?

pratikbsp · March 15, 2015, 4:27pm

C:\Users\pratikbsp\LanguageTool-2.8>java -jar languagetool.jar
java.lang.RuntimeException: java.io.IOException: Cannot load or parse input stre
am of ‘/org/languagetool/rules/en/grammar.xml’
at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageT
oolSupport.java:289)
at org.languagetool.gui.LanguageToolSupport.init(LanguageToolSupport.jav
a:315)
at org.languagetool.gui.LanguageToolSupport.(LanguageToolSupport.j
ava:142)
at org.languagetool.gui.Main.createGUI(Main.java:322)
at org.languagetool.gui.Main.access$1800(Main.java:54)
at org.languagetool.gui.Main$7.run(Main.java:871)
at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:251)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:733)
at java.awt.EventQueue.access$200(EventQueue.java:103)
at java.awt.EventQueue$3.run(EventQueue.java:694)
at java.awt.EventQueue$3.run(EventQueue.java:692)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDo
main.java:76)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:703)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThre
ad.java:242)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.
java:161)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThre
ad.java:150)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146)

    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138)

    at java.awt.EventDispatchThread.run(EventDispatchThread.java:91)

Caused by: java.io.IOException: Cannot load or parse input stream of ‘/org/langu
agetool/rules/en/grammar.xml’
at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRul
eLoader.java:76)
at org.languagetool.Language.getPatternRules(Language.java:459)
at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageT
ool.java:315)
at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageT
oolSupport.java:285)
… 19 more
Caused by: java.lang.IllegalArgumentException: No IssueType found for name ‘???
???’. Valid values: [terminology, mistranslation, omission, untranslated, add
ition, duplication, inconsistency, grammar, legal, register, locale-specific-con
tent, locale-violation, style, characters, misspelling, typographical, formattin
g, inconsistent-entities, numbers, markup, pattern-problem, whitespace, internat
ionalization, length, non-conformance, uncategorized, other]
at org.languagetool.rules.ITSIssueType.getIssueType(ITSIssueType.java:46
)
at org.languagetool.rules.patterns.PatternRuleHandler.prepareRule(Patter
nRuleHandler.java:625)
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(Patter
nRuleHandler.java:543)
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(Patter
nRuleHandler.java:556)
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(Patter
nRuleHandler.java:556)
at org.languagetool.rules.patterns.PatternRuleHandler.endElement(Pattern
RuleHandler.java:321)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endEleme
nt(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamesp
aceScope(XMLDTDValidator.java:2054)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEnd
Element(XMLDTDValidator.java:2005)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElemen
t(XMLDTDValidator.java:879)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImp
l.scanEndElement(XMLDocumentFragmentScannerImpl.java:1789)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImp
l$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2965)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(X
MLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImp
l.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(X
ML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(X
ML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.
java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Ab
stractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.p
arse(SAXParserImpl.java:649)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParser
Impl.java:333)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRul
eLoader.java:73)
… 22 more

pratikbsp · March 15, 2015, 4:28pm

it worked fine till language was english but gave this error when i added hindi

dnaber · March 15, 2015, 4:52pm

This part of the error message explains that you used an invalid value for “type” and what the valid attributes are:

No IssueType found for name '??? ??????'. Valid values: [terminology, mistranslation,
omission, untranslated, add ition, duplication, inconsistency, grammar, legal, register, locale-specific-con
tent, locale-violation, style, characters, misspelling, typographical, formatting, inconsistent-entities,
numbers, markup, pattern-problem, whitespace, internationalization, length, non-conformance,
uncategorized, other]

pratikbsp · March 15, 2015, 5:37pm

thanks its working now.
How many minimum rules are required to get started.I have added approx 25 rules till now.
How much time does it take to include a language.

dnaber · March 16, 2015, 8:13am

Have you forked our github repo to make your changes? If not, please do so, so we can look at you changes.

Also, have you tested your rules against larger texts yet? We have documented this here: Developing robust rules - LanguageTool Wiki

pratikbsp · March 16, 2015, 5:54pm

I have written all the rules and have forked repository from github.But due to large size haven’t cloned it till now.Cloning got stopped in the middle a lot of time.
For checking rules against wikipedia dump
i used this
C:\Users\pratikbsp\LanguageTool-2.8\LanguageTool-wikipedia-2.9-SNAPSHOT>java -jar languagetool-wikipedia.jar check-data - - hi hiwiki-latest-pages-articles.xml it says l and f missing how do i specify l and f.

dnaber · March 16, 2015, 6:31pm

The command syntax at Developing robust rules - LanguageTool Wiki wasn’t up-to-date, I’ve fixed that.

pratikbsp · March 16, 2015, 7:28pm

i used the following command it gave me error-

C:\Users\pratikbsp\LanguageTool-2.8\LanguageTool-wikipedia-2.9-SNAPSHOT>java -ja
r languagetool-wikipedia.jar check-data -l en -f hiwiki-latest-pages-articles.xm
l -r jarasa --max-errors 100
WARNING: Could not find rule ‘jarasa’
Only these rules are enabled: [jarasa]
All spelling rules are disabled
Working on: hiwiki-latest-pages-articles.xml
Sentence limit: no limit
Error limit: 100
English: 0 total matches
English: °NaN rule matches per sentence
Exception in thread “main” java.io.FileNotFoundException: hiwiki-latest-pages-ar
ticles.xml (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at java.io.FileInputStream.(FileInputStream.java:101)
at org.languagetool.dev.dumpcheck.MixingSentenceSource.create(MixingSent
enceSource.java:45)
at org.languagetool.dev.dumpcheck.SentenceSourceChecker.run(SentenceSour
ceChecker.java:165)
at org.languagetool.dev.dumpcheck.SentenceSourceChecker.main(SentenceSou
rceChecker.java:78)
at org.languagetool.dev.wikipedia.Main.main(Main.java:45)

where do i need to put dump files and languagetool wikipedia.Do i need to specify the location of grammar.xml file for getting the id of myrule?

dnaber · March 17, 2015, 7:56am

You can put the dump file anywhere as long as you specify the full path to it. The grammar.xml that will be used is this one: org/languagetool/rules/en/grammar.xml

pratikbsp · March 17, 2015, 10:15am

I’m having difficulty in checking large dump:-
I used this command:-
C:\Users\pratikbsp>java -jar C:\Users\pratikbsp\LanguageTool-wikipedia-2.9-SNAPS
HOT\languagetool-wikipedia.jar check-data -l en -f C:\Users\pratikbsp\hiwiki-lat
est-pages-articles.xml -r jarasa --max-errors 100

it says couldn’t find rule jarasa.
Hindi rules are in en/grammar.xml
This is how rule in grammar.xml is:-

<rule id="‌jarasa" name="बस थोड़ा‌सा (बस जरा सा)">    
            <pattern>
		<token>बस </token>
                <token >जरासा</token>
            </pattern>
            <message>क्या आपका तात्पर्य <suggestion>बस जरा सा</suggestion> से हैं ? क्योँकि जरासा एक गलत शब्द है।</message>
            <example type="incorrect">मेरा<marker>बस जरा सा</marker>काम बचा है।</example>
	    <example type="correct">मेरा<marker>बस जरासा</marker>काम बचा है।</example>
        </rule>

dnaber · March 17, 2015, 10:48am

That’s difficult to debug from here. I think you should first clone your fork of the LT repository. Here are some tips to do so for large repos: How do I clone a git repo that has become too large? - Stack Overflow

pratikbsp · March 17, 2015, 12:05pm

I have cloned the repo, Sould I do this now from repo directory?