Adding a new language, Norwegian

Juan_Martorell · April 22, 2016, 11:59am

Sure! Just download the files in FreeLing/data/nb/dictionary/entries at master · TALP-UPC/FreeLing · GitHub (rememnber, the raw data) and you can build a quick and dirty dictionary:

$ cat *.txt | sort > dictionary.dump $ wc -l dictionary.dump 867603 dictionary.dump $

There are less than than the 1.1 million forms you found before, but they have a predictable EAGLES format, which is a good starting point and besides you can grow it later.

I’m with an ongoing effort to grow the Spanish dictionary in my lab. Feel free to browse it and asking questions and making comments on it. If people around find that work useful, I can turn it easily into a language agnostic project.

Eirik · April 30, 2016, 10:36pm

Thanks Juan, that link was very helpful.

I was able to create a binary dictionary file now. However, when trying to restore a raw text file from the created binary, I run into an error:

An unhandled exception occurred. Stack trace below.
java.io.IOException: Invalid file header, probably not an FSA.
	at morfologik.fsa.FSAHeader.read(FSAHeader.java:45)
	at morfologik.fsa.FSA.read(FSA.java:312)
	at morfologik.stemming.Dictionary.read(Dictionary.java:102)
	at morfologik.stemming.Dictionary.read(Dictionary.java:65)
	at morfologik.tools.DictDecompile.call(DictDecompile.java:62)
	at morfologik.tools.DictDecompile.call(DictDecompile.java:20)
	at morfologik.tools.CliTool.main(CliTool.java:133)
	at morfologik.tools.DictDecompile.main(DictDecompile.java:132)
	at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:80)
	at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59)
Done. The dictionary export has been written to dictionary.dump

I don’t know if this is a big deal, but it seems like something might have gotten wrong with the dict creation since it’s unable to restore the original text from the binary.

My original text file when creating the data was tab-separated (I did it myself first, and then later used freeling2lt.pl in Spanish repo)

Here are some related files:

dnaber · May 2, 2016, 4:15pm

Is this the .dict file in your git? I can export that like this:

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i ~/Downloads/norwegian.dict -info ~/Downloads/norwegian.info -o /tmp/no.txt

This produces a 27MB text file for me.

Eirik · May 8, 2016, 5:00pm

I found my error - a silly typo. Thanks!

I’ve added the tagger now! I think my next step will be to just add some rules and get them working.

@Juan_Martorell I’m finding the Eagles tags a bit hard to read. Would there be some way to alias these tags for the sake of rule creation?

I could consider writing an XML processor for this task if there isn’t. There’s an open issue on Github, so I might write there.

The goal would be to allow us to use a simpler more verbose syntax, so that people who are familiar with English grammar but not with Eagles, could just have their input converted later.

EDIT: Just want to report that the POS tagger works for Norwegian; i.e. I’m able to write rules using them! Amazing! I feel like I can now focus on writing the first 50-100 rules, then perhaps add more features when things get more familiar.

Juan_Martorell · May 9, 2016, 10:13am

Once you get used to them, they are not that arid. In fact, to me they look tidy now. My suggestion is that you print a cheatsheet on paper like I have with the codes, descriptions and a sample.

Great! try to implement this easy rule:

Jeg gikk til han → suggest ham
Di venter for han → suggest ham

With synthesis you can also put into the same rule

Jeg gikk til hun → suggest hennes

The latter is less seen (just perpetrated by nøwbies like me), but in order to practice it does not hurt.
My Norwegian grammar is awful, but I hope you understand what I mean.

Eirik · May 9, 2016, 11:00am

That’s a good idea for a rule!

Just one problem: I still haven’t been able to find a cheatsheet even. I’m using this:

github.com

TALP-UPC/FreeLing/blob/master/data/nb/tagset.dat

<DecompositionRules>
A 2 adjective type/O:ordinal;Q:qualificative degree/S:superlative;A:comparative;P:positive gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural function/P:participle;R:preparticiple case/G:genitive definite/D:yes;U:no
C 2 conjunction type/C:coordinating;S:subordinating;A:adverbial
D 2 determiner type/D:demonstrative;P:possessive;T:interrogative;M:amplifier;Q:quantifier person/1:1;2:2;3:3 gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural definite/D:yes;U:no other/P:polite;R:reciprocal
N 2 noun type/C:common;P:proper gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural neclass/S:person;G:location;O:organization;V:other nesubclass/0:0;P:0 case/N:nominative;G:genitive definite/D:yes;U:no
P 2 pronoun type/D:demonstrative;Q:quantifier;T:interrogative;P:personal;X:possessive;R:relative;C:reciprocal;F:reflexive person/1:1;2:2;3:3 gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural case/N:nominative;A:accusative polite/P:yes human/H:yes
R 2 adverb type/N:negative;G:general
S 2 adposition type/P:preposition contracted/S:0 gen/M:masculine;F:feminine num/S:singular;P:plural
V 3 verb type/M:main;A:auxiliary;S:semiauxiliary;V:sverb;P:passive mood/I:indicative;M:imperative;P:participle;N:infinitive tense/P:present;S:past
Z 2 number type/d:partitive;m:currency;p:percentage;u:unit
W 0 date
I 0 interjection
</DecompositionRules>
<DirectTranslations>
Fc Fc pos=punctuation|type=comma
Fs Fs pos=punctuation|type=etc
Fd Fd pos=punctuation|type=colon
Fx Fx pos=punctuation|type=semicolon
Fg Fg pos=punctuation|type=hyphen
Fe Fe pos=punctuation|type=quotation

This file has been truncated. show original

but it’s like I’m hunting for the Rosetta stone and only finding fragments.

I can make sense of some tags, but I don’t know how to read these ones because of their atypical lengths;

til til CS
til til SPS00

Eirik · May 9, 2016, 12:06pm

  <rulegroup id="HAN_HAM" name="han(s)/ham(do)">
  	<rule>
  		<pattern>
  	    	<token regexp="yes" postag="CS|CC|SPS00" postag_regexp="yes"></token>
  			<token regexp="yes">han</token>
  		</pattern>
  		<message>Did you mean <suggestion>\1 ham</suggestion>?</message>
  	</rule>
  	<rule>
  		<pattern>
  	    	<token regexp="yes" postag="CS|CC|SPS00" postag_regexp="yes"></token>
  			<token regexp="yes">hun</token>
  		</pattern>
  		<message>Did you mean <suggestion>\1 henne</suggestion>?</message>
  	</rule>
  </rulegroup>

This is working well!

Juan_Martorell · May 9, 2016, 1:42pm

Try mine. It’s made of another document. Though it is in Spanish, it’s not difficult to figure it out what is what. Please fell free to ask should you have trouble with it.

I guess the coding is similar: just match the position. First is the category, second is type. From there, depending on the type you get all the possible values.

Jan_Schreiber · May 9, 2016, 5:17pm

Just a side note: This will probably cause a validation error because “han” is unnecessarily flagged as a regular expression.

Eirik · May 21, 2016, 5:14pm

Thanks. I’ll keep in mind to only use the regexp attribute when needed!

Eirik · May 21, 2016, 5:26pm

Thanks for the cheatsheet! It’s starting to make more sense.

My next plan now is to make ~100 rules. I currently have material for only about 10, based on my own errors.

I’m thinking it would be fun to perhaps run checks Norwegian Wikipedia (tbh there’s a lot of machine-translated garbage there, depending on article … so a focus on rooting out MT-generated anglicisms might be beneficial to a lift in quality).

If you have any requests as to the sort of errors you’d like covered, that would be nice. Otherwise I will start implementing any ‘top 100 errors’ I find online, which are not covered by a spelling dictionary.

I’d like to focus on adding rules for errors that people actually make somewhat frequently, so having a corpus to go by would be quite useful (though I guess I can use Google to get an indication of frequency).

One more thing: Do you know if it would be possible to host my own languagetoo.org website in order to add temporary support for Norwegian on the front page? This way testing Norwegian would be much easier.

dnaber · May 21, 2016, 5:55pm

Hi Eirik,

you can get the website from GitHub - languagetool-org/languagetool-website: DO NOT USE, THIS IS OUTDATED, all you need is a PHP-enabled web server. Also, you’ll need to point the check to your own version of the API which already support Norwegian.

An idea about the rules: I don’t know how many people are learning Norwegian, but you could also consider adding rules for language learners. Most languages in LT are not optimized for that use case.

Eirik · May 21, 2016, 6:08pm

Thanks, I’ll set this up on my Nginx! This way I could invite people who may be interested in contributing to paste their Norwegian onto my test website

I think I’ll take this approach:

Add errors I make myself
Common error lists I find online
Common errors made by learners

(3) might make it possible to get some assistance from Tatoeba.

Also, I will be testing the wikipedia module later, so that I might test my rules on a larger corpus. This should prove interesting and quicker than working in my own sandbox where I’m currently writing my own test sentences in a text file!

Eirik · May 29, 2016, 3:46pm

I’ve started the web server as well as the service on default port.

In an attempt to reroute the website to my own languagetool-server, I did a global replace of “https://languagetool.org” to “http://localhost”.

However, I now get an error like “Error: Did not get response from service. Please try again in one minute.” when trying the website.

Any ideas?

My next plan is to add Norwegian in the HTML as needed, to bring up an example text, and then I should be able to use it normally.

dnaber · May 29, 2016, 4:05pm

Please press F12 to open the development panel in your browser. In the network tab, you should see more information about why the request fails. Also, we’re just in the process of switching to a new API, so please make sure you’re using the very latest version of the website and the LT server.

Eirik · May 29, 2016, 4:30pm

Thanks!

I simply had to add the port to the URL and add allow origin *.

Here’s some future reference if someone is in the same position and finds this thread.

Clone the website:
git clone https://github.com/languagetool-org/languagetool-website

Link the website to your own running LanguageTool service. Quick and dirty way to do this [unwanted side-effects may show up later] is to do a global replace:
find . -name '*.php' -type f -exec sed -i 's/https:\/\/languagetool.org/http:\/\/localhost:8081/' {} \;

Go to the folder of the compiled snapshot you wish to use:
java -cp language tool-server.jar org.languagetool.server.HTTPServer --public --allow-origin "*"

Juan_Martorell · November 12, 2016, 8:15pm

Did you manage to make it work, Eirik?

Eirik · November 12, 2016, 11:50pm

Hi Juan,

I did get it working, finally, back a few months ago now. I was able to add my own Norwegian tests, and the modified tokenizer was working just fine. Unfortunately, work got real busy as I had to start working on a different commercial tool. I also am translating a lot to make ends meet … as it’s now been a year of lots of programming without having earned a cent!

For Language Tool, I might wish to integrate LT into my commercial system as a plugin, as long as there is no legal or technical obstacle. I’m afraid Norwegian will be something I won’t be able to work on too much in the near future …

I will go over the Norwegian fork I’ve been working on when I get a chance, but I wish there were more people working on it! Perhaps you would like to help me create tests at a later point assuming I’ve already laid the ground work and we just need another few dozen tests to reach the min. required level?

Also, how is Catalan and Spanish doing these days?

jeblad · October 15, 2017, 3:07pm

I’m going to do some work on Norwegian. I will first add some Wikipedia-specific rules, and then perhaps some more general rules.

I believe Wikipedia is the most important community to get traction to such a project, and especially to get a working solution for VisualEditor.