Back to LanguageTool Homepage - Privacy - Imprint

Adding a new language Hindi


(pratikbsp) #1

Hello,
I am Pratik.I find languageTool interesting.So I would like to contribute to languageTool by adding a language Hindi.Hindi is a language which is used by nearly 497 million people for their general use.I will be very pleased to contribute to languageTool and be a part of languageTool community.
Please guide me how do I contribute to languageTool.


(Daniel Naber) #2

Hi Pratik, thanks for your interest in LanguageTool. Please see our introduction at http://wiki.languagetool.org/development-overview for a start. As Hindi is not supported yet, it cannot be selected as a language in LT, but you can write error detection rules in any other language with a similar tokenization (i.e. any language that uses whitespace between words, as most languages do).

So as a first step you can write simple rules that refer to words, not yet to the words' part-of-speech. Later a dictionary could be added for that. We'd be happy to welcome you as a contributor, just let us know if you have any questions (here or on our mailing list).

Regards
Daniel


(pratikbsp) #3

Is their any way to add hindi ?


(Daniel Naber) #4

It's documented at http://wiki.languagetool.org/adding-a-new-language. But I recommend to first write your rules in the grammar.xml file of a different language. Once you've written some rules, the developers can help you with this step if you're not familiar with Java.


(pratikbsp) #5

I'm a java developer.How do i begin to write grammer.xml file.The languages other then Hindi I know is English.Should I first write grammer.xml files for english and then for Hindi?


(Daniel Naber) #6

I suggest you take the English grammar.xml, delete the English rules and add your rules there for Hindi. Later, when you have added Hindi as a language, you can copy your rules to the Hindi grammar.xml. Have you already forked LanguageTool at github? I suggest you do so, and work in that fork until Hindi is ready to be included into the "official" LanguageTool.

How to write rules is documented at http://wiki.languagetool.org/development-overview.


(pratikbsp) #7

I added few rules but when i tried to open languagetool.jar it showed error.Whereas when rules were in hindi it worked fine.
For e.g. rule looks alike:-

<rule name="संभव मुद्रण गलती" type="गलत वर्तनी">


    <rule id="कि कुंजि" name="कि कुंजि  (की कुंजी)">    
                <pattern>
                    <token >कि</token>
                    <token >कुंजि</token>
                </pattern>
                <message>क्या आपका तात्पर्य <suggestion><match no="1"/>की कुंजी </suggestion>से हैं ?</message>
                <example type="correct">समय, सफलता  <marker>की कुंजी </marker>है।</example>
                <example type="incorrect" correction="की कुंजी">समय, सफलता <marker>कि कुंजि</marker>है।</example>
            </rule>

	</category>

(Daniel Naber) #8

What was the exact error message?


(pratikbsp) #9

C:\Users\pratikbsp\LanguageTool-2.8>java -jar languagetool.jar
java.lang.RuntimeException: java.io.IOException: Cannot load or parse input stre
am of '/org/languagetool/rules/en/grammar.xml'
at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageT
oolSupport.java:289)
at org.languagetool.gui.LanguageToolSupport.init(LanguageToolSupport.jav
a:315)
at org.languagetool.gui.LanguageToolSupport.(LanguageToolSupport.j
ava:142)
at org.languagetool.gui.Main.createGUI(Main.java:322)
at org.languagetool.gui.Main.access$1800(Main.java:54)
at org.languagetool.gui.Main$7.run(Main.java:871)
at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:251)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:733)
at java.awt.EventQueue.access$200(EventQueue.java:103)
at java.awt.EventQueue$3.run(EventQueue.java:694)
at java.awt.EventQueue$3.run(EventQueue.java:692)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDo
main.java:76)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:703)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThre
ad.java:242)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.
java:161)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThre
ad.java:150)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146)

    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138)

    at java.awt.EventDispatchThread.run(EventDispatchThread.java:91)

Caused by: java.io.IOException: Cannot load or parse input stream of '/org/langu
agetool/rules/en/grammar.xml'
at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRul
eLoader.java:76)
at org.languagetool.Language.getPatternRules(Language.java:459)
at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageT
ool.java:315)
at org.languagetool.gui.LanguageToolSupport.reloadLanguageTool(LanguageT
oolSupport.java:285)
... 19 more
Caused by: java.lang.IllegalArgumentException: No IssueType found for name '???
??????'. Valid values: [terminology, mistranslation, omission, untranslated, add
ition, duplication, inconsistency, grammar, legal, register, locale-specific-con
tent, locale-violation, style, characters, misspelling, typographical, formattin
g, inconsistent-entities, numbers, markup, pattern-problem, whitespace, internat
ionalization, length, non-conformance, uncategorized, other]
at org.languagetool.rules.ITSIssueType.getIssueType(ITSIssueType.java:46
)
at org.languagetool.rules.patterns.PatternRuleHandler.prepareRule(Patter
nRuleHandler.java:625)
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(Patter
nRuleHandler.java:543)
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(Patter
nRuleHandler.java:556)
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(Patter
nRuleHandler.java:556)
at org.languagetool.rules.patterns.PatternRuleHandler.endElement(Pattern
RuleHandler.java:321)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endEleme
nt(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamesp
aceScope(XMLDTDValidator.java:2054)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEnd
Element(XMLDTDValidator.java:2005)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElemen
t(XMLDTDValidator.java:879)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImp
l.scanEndElement(XMLDocumentFragmentScannerImpl.java:1789)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImp
l$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2965)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(X
MLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImp
l.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(X
ML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(X
ML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.
java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Ab
stractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.p
arse(SAXParserImpl.java:649)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParser
Impl.java:333)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRul
eLoader.java:73)
... 22 more


(pratikbsp) #10

it worked fine till language was english but gave this error when i added hindi


(Daniel Naber) #11

This part of the error message explains that you used an invalid value for "type" and what the valid attributes are:

No IssueType found for name '??? ??????'. Valid values: [terminology, mistranslation,

omission, untranslated, add ition, duplication, inconsistency, grammar, legal, register, locale-specific-con
tent, locale-violation, style, characters, misspelling, typographical, formatting, inconsistent-entities,
numbers, markup, pattern-problem, whitespace, internationalization, length, non-conformance,
uncategorized, other]

(pratikbsp) #12

thanks its working now.
How many minimum rules are required to get started.I have added approx 25 rules till now.
How much time does it take to include a language.


(Daniel Naber) #13

Have you forked our github repo to make your changes? If not, please do so, so we can look at you changes.

Also, have you tested your rules against larger texts yet? We have documented this here: http://wiki.languagetool.org/developing-robust-rules#toc3


(pratikbsp) #14

I have written all the rules and have forked repository from github.But due to large size haven't cloned it till now.Cloning got stopped in the middle a lot of time.
For checking rules against wikipedia dump
i used this
C:\Users\pratikbsp\LanguageTool-2.8\LanguageTool-wikipedia-2.9-SNAPSHOT>java -jar languagetool-wikipedia.jar check-data - - hi hiwiki-latest-pages-articles.xml it says l and f missing how do i specify l and f.


(Daniel Naber) #15

The command syntax at http://wiki.languagetool.org/developing-robust-rules#toc3 wasn't up-to-date, I've fixed that.


(pratikbsp) #16

i used the following command it gave me error-

C:\Users\pratikbsp\LanguageTool-2.8\LanguageTool-wikipedia-2.9-SNAPSHOT>java -ja
r languagetool-wikipedia.jar check-data -l en -f hiwiki-latest-pages-articles.xm
l -r jarasa --max-errors 100
WARNING: Could not find rule 'jarasa'
Only these rules are enabled: [jarasa]
All spelling rules are disabled
Working on: hiwiki-latest-pages-articles.xml
Sentence limit: no limit
Error limit: 100
English: 0 total matches
English: °NaN rule matches per sentence
Exception in thread "main" java.io.FileNotFoundException: hiwiki-latest-pages-ar
ticles.xml (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at java.io.FileInputStream.(FileInputStream.java:101)
at org.languagetool.dev.dumpcheck.MixingSentenceSource.create(MixingSent
enceSource.java:45)
at org.languagetool.dev.dumpcheck.SentenceSourceChecker.run(SentenceSour
ceChecker.java:165)
at org.languagetool.dev.dumpcheck.SentenceSourceChecker.main(SentenceSou
rceChecker.java:78)
at org.languagetool.dev.wikipedia.Main.main(Main.java:45)

where do i need to put dump files and languagetool wikipedia.Do i need to specify the location of grammar.xml file for getting the id of myrule?


(Daniel Naber) #17

You can put the dump file anywhere as long as you specify the full path to it. The grammar.xml that will be used is this one: org/languagetool/rules/en/grammar.xml


(pratikbsp) #18

I'm having difficulty in checking large dump:-
I used this command:-
C:\Users\pratikbsp>java -jar C:\Users\pratikbsp\LanguageTool-wikipedia-2.9-SNAPS
HOT\languagetool-wikipedia.jar check-data -l en -f C:\Users\pratikbsp\hiwiki-lat
est-pages-articles.xml -r jarasa --max-errors 100

it says couldn't find rule jarasa.
Hindi rules are in en/grammar.xml
This is how rule in grammar.xml is:-

<rule id="‌jarasa" name="बस थोड़ा‌सा (बस जरा सा)">    
            <pattern>
		<token>बस </token>
                <token >जरासा</token>
            </pattern>
            <message>क्या आपका तात्पर्य <suggestion>बस जरा सा</suggestion> से हैं ? क्योँकि जरासा एक गलत शब्द है।</message>
            <example type="incorrect">मेरा<marker>बस जरा सा</marker>काम बचा है।</example>
	    <example type="correct">मेरा<marker>बस जरासा</marker>काम बचा है।</example>
        </rule>

(Daniel Naber) #19

That's difficult to debug from here. I think you should first clone your fork of the LT repository. Here are some tips to do so for large repos: http://stackoverflow.com/questions/18850860/how-do-i-clone-a-git-repo-that-has-become-too-large


(pratikbsp) #20

I have cloned the repo, Sould I do this now from repo directory?