Adding a new language, Norwegian

Eirik · April 10, 2016, 4:55pm

Thanks, it worked!

I’m starting to convert Norwegian data. Here is some data I reworked and renamed to be easier to read:

inflected form | original form | grammar data

fisk	fisk	 noun  masculine  appellative  singular  indefinite 
fisken	fisk	 noun  masculine  appellative  singular  definite 
fisker	fisk	 noun  masculine  appellative  plural  indefinite 
fiskene	fisk	 noun  masculine  appellative  plural  definite

I am thinking it would be best to stick with the English abbreviations like NN, NNS, and similar. The main difference from English id that the indefinite/definite dimension is expressed as word endings instead of as determiners + Norwegian has 3 genders.

My next step is to follow the “Developing a Tagger Dictionary” procedure and see what happens.

Juan_Martorell · April 10, 2016, 5:47pm

As you look quite resourceful, I’d advice to use the EAGLES tagging, good standard, easy to follow and meant for computer processing.

MarcinMilkowski · April 11, 2016, 4:19pm

That’s true. I don’t particularly like our English tagset, as it is not too informative (for example, it does not distinguish inflected and base forms of personal pronouns in tags, they are all marked as PRP, which makes writing rules painful). Try to use a tagset that retains all the information you already have in the original data set.

Eirik · April 22, 2016, 3:15am

@Juan I’m looking into EAGLES, but I’m having a bit of trouble navigating the website. It’s hard to make heads or tails of it to be honest. If you could steer me to a table of POS tags or something that would be of particular relevance, that would be awesome!

For now, just to see what I’m dealing with here, I’m looking into how the existing English tag set maps over to Norwegian. My plan is then to juxtapose this data with EAGLES later.

In case you’re interested, here is the paper that suggested around 95 % POS tagging accuracy for Norwegian:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/801_Paper.pdf
It talks about Freeling and other things.

@Marcin Thanks. I will make sure to preserve all the original data.

Eirik · April 22, 2016, 3:21am

Apparently FreeLing has quite a bit of Norwegian modules now. It appears I might be off to a good start if I can use as much as possible from here:

Juan_Martorell · April 22, 2016, 11:59am

Sure! Just download the files in FreeLing/data/nb/dictionary/entries at master · TALP-UPC/FreeLing · GitHub (rememnber, the raw data) and you can build a quick and dirty dictionary:

$ cat *.txt | sort > dictionary.dump $ wc -l dictionary.dump 867603 dictionary.dump $

There are less than than the 1.1 million forms you found before, but they have a predictable EAGLES format, which is a good starting point and besides you can grow it later.

I’m with an ongoing effort to grow the Spanish dictionary in my lab. Feel free to browse it and asking questions and making comments on it. If people around find that work useful, I can turn it easily into a language agnostic project.

Eirik · April 30, 2016, 10:36pm

Thanks Juan, that link was very helpful.

I was able to create a binary dictionary file now. However, when trying to restore a raw text file from the created binary, I run into an error:

An unhandled exception occurred. Stack trace below.
java.io.IOException: Invalid file header, probably not an FSA.
	at morfologik.fsa.FSAHeader.read(FSAHeader.java:45)
	at morfologik.fsa.FSA.read(FSA.java:312)
	at morfologik.stemming.Dictionary.read(Dictionary.java:102)
	at morfologik.stemming.Dictionary.read(Dictionary.java:65)
	at morfologik.tools.DictDecompile.call(DictDecompile.java:62)
	at morfologik.tools.DictDecompile.call(DictDecompile.java:20)
	at morfologik.tools.CliTool.main(CliTool.java:133)
	at morfologik.tools.DictDecompile.main(DictDecompile.java:132)
	at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:80)
	at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59)
Done. The dictionary export has been written to dictionary.dump

I don’t know if this is a big deal, but it seems like something might have gotten wrong with the dict creation since it’s unable to restore the original text from the binary.

My original text file when creating the data was tab-separated (I did it myself first, and then later used freeling2lt.pl in Spanish repo)

Here are some related files:

dnaber · May 2, 2016, 4:15pm

Is this the .dict file in your git? I can export that like this:

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i ~/Downloads/norwegian.dict -info ~/Downloads/norwegian.info -o /tmp/no.txt

This produces a 27MB text file for me.

Eirik · May 8, 2016, 5:00pm

I found my error - a silly typo. Thanks!

I’ve added the tagger now! I think my next step will be to just add some rules and get them working.

@Juan_Martorell I’m finding the Eagles tags a bit hard to read. Would there be some way to alias these tags for the sake of rule creation?

I could consider writing an XML processor for this task if there isn’t. There’s an open issue on Github, so I might write there.

The goal would be to allow us to use a simpler more verbose syntax, so that people who are familiar with English grammar but not with Eagles, could just have their input converted later.

EDIT: Just want to report that the POS tagger works for Norwegian; i.e. I’m able to write rules using them! Amazing! I feel like I can now focus on writing the first 50-100 rules, then perhaps add more features when things get more familiar.

Juan_Martorell · May 9, 2016, 10:13am

Once you get used to them, they are not that arid. In fact, to me they look tidy now. My suggestion is that you print a cheatsheet on paper like I have with the codes, descriptions and a sample.

Great! try to implement this easy rule:

Jeg gikk til han → suggest ham
Di venter for han → suggest ham

With synthesis you can also put into the same rule

Jeg gikk til hun → suggest hennes

The latter is less seen (just perpetrated by nøwbies like me), but in order to practice it does not hurt.
My Norwegian grammar is awful, but I hope you understand what I mean.

Eirik · May 9, 2016, 11:00am

That’s a good idea for a rule!

Just one problem: I still haven’t been able to find a cheatsheet even. I’m using this:

github.com

TALP-UPC/FreeLing/blob/master/data/nb/tagset.dat

<DecompositionRules>
A 2 adjective type/O:ordinal;Q:qualificative degree/S:superlative;A:comparative;P:positive gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural function/P:participle;R:preparticiple case/G:genitive definite/D:yes;U:no
C 2 conjunction type/C:coordinating;S:subordinating;A:adverbial
D 2 determiner type/D:demonstrative;P:possessive;T:interrogative;M:amplifier;Q:quantifier person/1:1;2:2;3:3 gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural definite/D:yes;U:no other/P:polite;R:reciprocal
N 2 noun type/C:common;P:proper gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural neclass/S:person;G:location;O:organization;V:other nesubclass/0:0;P:0 case/N:nominative;G:genitive definite/D:yes;U:no
P 2 pronoun type/D:demonstrative;Q:quantifier;T:interrogative;P:personal;X:possessive;R:relative;C:reciprocal;F:reflexive person/1:1;2:2;3:3 gen/F:feminine;M:masculine;C:common;N:neuter num/S:singular;P:plural case/N:nominative;A:accusative polite/P:yes human/H:yes
R 2 adverb type/N:negative;G:general
S 2 adposition type/P:preposition contracted/S:0 gen/M:masculine;F:feminine num/S:singular;P:plural
V 3 verb type/M:main;A:auxiliary;S:semiauxiliary;V:sverb;P:passive mood/I:indicative;M:imperative;P:participle;N:infinitive tense/P:present;S:past
Z 2 number type/d:partitive;m:currency;p:percentage;u:unit
W 0 date
I 0 interjection
</DecompositionRules>
<DirectTranslations>
Fc Fc pos=punctuation|type=comma
Fs Fs pos=punctuation|type=etc
Fd Fd pos=punctuation|type=colon
Fx Fx pos=punctuation|type=semicolon
Fg Fg pos=punctuation|type=hyphen
Fe Fe pos=punctuation|type=quotation

This file has been truncated. show original

but it’s like I’m hunting for the Rosetta stone and only finding fragments.

I can make sense of some tags, but I don’t know how to read these ones because of their atypical lengths;

til til CS
til til SPS00

Eirik · May 9, 2016, 12:06pm

  <rulegroup id="HAN_HAM" name="han(s)/ham(do)">
  	<rule>
  		<pattern>
  	    	<token regexp="yes" postag="CS|CC|SPS00" postag_regexp="yes"></token>
  			<token regexp="yes">han</token>
  		</pattern>
  		<message>Did you mean <suggestion>\1 ham</suggestion>?</message>
  	</rule>
  	<rule>
  		<pattern>
  	    	<token regexp="yes" postag="CS|CC|SPS00" postag_regexp="yes"></token>
  			<token regexp="yes">hun</token>
  		</pattern>
  		<message>Did you mean <suggestion>\1 henne</suggestion>?</message>
  	</rule>
  </rulegroup>

This is working well!

Juan_Martorell · May 9, 2016, 1:42pm

Try mine. It’s made of another document. Though it is in Spanish, it’s not difficult to figure it out what is what. Please fell free to ask should you have trouble with it.

I guess the coding is similar: just match the position. First is the category, second is type. From there, depending on the type you get all the possible values.

Jan_Schreiber · May 9, 2016, 5:17pm

Just a side note: This will probably cause a validation error because “han” is unnecessarily flagged as a regular expression.

Eirik · May 21, 2016, 5:14pm

Thanks. I’ll keep in mind to only use the regexp attribute when needed!

Eirik · May 21, 2016, 5:26pm

Thanks for the cheatsheet! It’s starting to make more sense.

My next plan now is to make ~100 rules. I currently have material for only about 10, based on my own errors.

I’m thinking it would be fun to perhaps run checks Norwegian Wikipedia (tbh there’s a lot of machine-translated garbage there, depending on article … so a focus on rooting out MT-generated anglicisms might be beneficial to a lift in quality).

If you have any requests as to the sort of errors you’d like covered, that would be nice. Otherwise I will start implementing any ‘top 100 errors’ I find online, which are not covered by a spelling dictionary.

I’d like to focus on adding rules for errors that people actually make somewhat frequently, so having a corpus to go by would be quite useful (though I guess I can use Google to get an indication of frequency).

One more thing: Do you know if it would be possible to host my own languagetoo.org website in order to add temporary support for Norwegian on the front page? This way testing Norwegian would be much easier.

dnaber · May 21, 2016, 5:55pm

Hi Eirik,

you can get the website from GitHub - languagetool-org/languagetool-website: DO NOT USE, THIS IS OUTDATED, all you need is a PHP-enabled web server. Also, you’ll need to point the check to your own version of the API which already support Norwegian.

An idea about the rules: I don’t know how many people are learning Norwegian, but you could also consider adding rules for language learners. Most languages in LT are not optimized for that use case.

Eirik · May 21, 2016, 6:08pm

Thanks, I’ll set this up on my Nginx! This way I could invite people who may be interested in contributing to paste their Norwegian onto my test website

I think I’ll take this approach:

Add errors I make myself
Common error lists I find online
Common errors made by learners

(3) might make it possible to get some assistance from Tatoeba.

Also, I will be testing the wikipedia module later, so that I might test my rules on a larger corpus. This should prove interesting and quicker than working in my own sandbox where I’m currently writing my own test sentences in a text file!

Eirik · May 29, 2016, 3:46pm

I’ve started the web server as well as the service on default port.

In an attempt to reroute the website to my own languagetool-server, I did a global replace of “https://languagetool.org” to “http://localhost”.

However, I now get an error like “Error: Did not get response from service. Please try again in one minute.” when trying the website.

Any ideas?

My next plan is to add Norwegian in the HTML as needed, to bring up an example text, and then I should be able to use it normally.

dnaber · May 29, 2016, 4:05pm

Please press F12 to open the development panel in your browser. In the network tab, you should see more information about why the request fails. Also, we’re just in the process of switching to a new API, so please make sure you’re using the very latest version of the website and the LT server.