Adding a new language, Norwegian

Eirik · March 29, 2016, 11:59pm

Hi all!

I’d like to add Norwegian as a language if possible. Any particular requirements? My day to day job is as a Norwegian translator, so I would use that as my main source. I can also adapt content from those languages I know, such as Danish / Swedish / English.

I write code in Perl and JavaScript, mostly language checks and CAT features to enhance online web tools (I was stuck with Google Translator Toolkit for a while …) and general work stuff like extracting POs from e-mails or converting dictionary data.

I hope you will give me an excuse to learn Java as well

Best,
Eirik Birkeland

dnaber · March 30, 2016, 7:10am

Hi Eirik,

thanks for your interest in LanguageTool! The technical details of adding a new language are documented at Adding A New Language - LanguageTool Wiki. You can either try that, or ask for help with it, as it’s a one-time task so someone from the core team could do it. But anyway I’d suggest you write some rules first - you don’t need Norwegian support as a first step. You can take another language’s grammar.xml file, clean it, and just add Norwegian rules there. Details about developing rules are documented at Development Overview - LanguageTool Wiki.

Please note that we can only add a new language once we see someone is going to maintain it actively for a long time. In other words, the first few months you’d be working in your fork before we add Norwegian to the official version of LanguageTool. This is because too many languages we support became unmaintained in the past.

Regards
Daniel

Juan_Martorell · March 31, 2016, 8:41am

Welcome to the community Eirik. I am the maintainer for Spanish. I live in Norway and adding Norwegian (Bokmål) to LT is in my wishlist. So please count on my help if you decide to go forward.

LT began as a set of rules written in XML format supported by Java rules when XML was not enough. As Daniel pointed out, in order to get familiar with LT the very first task is looking at the rules of a language that is accesible for you. In your case, I think English will do and from what I know, German shares a lot of grammar with Norwegian so it will be helpful.

To my view, the very main requirement is having a good abstraction capability, and proficiency using regular expressions. This will allow you to design good rules, broad and general instead of a large set of brute-force rules. For instance, in order to detect concordances mismatch like

boken er fint
boka er fint

Instead of two separate rules, you can write only one with

bok(a|en) er fint

Grammar analysis is done in several stages:

Segmentation
Tagging
Disambiguation
Rule evaluation
Suggestion synthesis

For this stages, only segmentation and rule evaluation are mandatory The rest are supported but need to be implemented. There are currently more stages, because there are statistical checks using n-gram data and confusion sets, but I am not yet familiar with them and documentation about the process in the wiki is getting a little old.

Segmentation divides the text into sentences and the sentences into tokens.

Tagging assigns all possible Part of Speech (PoS) tags to all tokens.

Disambiguation corrects PoS for the tokens via disambiguation rules.

Rule evaluation is where sentences are matched with the rules.

Suggestion synthesis is where the proposed correction generates an alternative from PoS, i.e. from the singular the system come up with a plural starting from the same lemma.

This introduction is just for letting you know that LT is feature rich and, even thogh it may seem intimidating, you can start from a simple brute-force rule matcher to a full-featured intelligent grammar proofer.

There is a caveat, however. If you are as lazy as I am, you will learn that by introducing tagging your ruleset (for rule evaluation stage) simplifies a lot. By disambiguating, there is also a noticeable simplification. The more complete tagger dictionary, the simpler you get the ruleset. This means that you should seriously consider starting ASAP with a tagger dictionary and implementing the disambiguator before the ruleset grows beyond control and technical debt accumulates.

That is an architectural decision to make if you are to maintain Norwegian.

About tooling, I find the rule editor a great entry point for newcomers. Unfortunately, this is not available for unsupported languages, so I suggest we add initial support for Norwegian, maybe on a separate branch, or incubator, if possible.

Eirik · April 1, 2016, 11:36pm

Thanks both for the comprehensive overview!

Glad to hear someone here has an interest in Norwegian!

I’m a big RegEx fan. I got into Perl just 2 yrs ago after reading Mastering Regular Expressions by Jeffrey. These days I am more into JS / Node though (while lamenting the lack of look-behinds … it might be included in ES7 if lucky)

I’ve written regular expressions for ‘missing commas’ in Norwegian, and tried to stretch the limits for what can be done with only a brute-force regex approach. I’m by no means an expert but always hoping to use it more.

I then looked into POS taggers and other things, but only very briefly - enough to kind of know what’s what; having done a few simple tests with NTLK for Python and similar.

I wonder if data such as seen in the following screenshot can be used for training a Norwegian POS tagger …?

(about 1.1 million unique word forms)

E.g. the noun fisk (fish) above has 4 entries: fisk-fisken-fisker-fiskene (ndefinite, definite, indefinite plural, definite plural) and meta-data like word class, singular/plural, etc. I know that someone in Norway did an experiment around 2014 and claimed 95% accuracy for POS tagging Norwegian, which sounds rather good to me. It was apparently based on an adaptation of an English tagger, I believe.

Anyway, I have much to familiarize myself with, and will get back to you later on this same thread Oh, and if I take a while to respond sometimes it’s because my clients are drowning me with work - sorry!

Eirik · April 2, 2016, 6:44pm

I have one question concerning this step:

Do I also adapt the contents of the renamed file (Norwegian.java)?

dnaber · April 2, 2016, 7:08pm

Yes, just replace English with Norwegian and delete the methods that don’t make sense yet like getTagger, getDisambiguator etc.

Eirik · April 2, 2016, 10:13pm

Thanks, that did the trick!

However, near the end I encountered another problem. I thought at first maybe I had made a mistake earlier, so I redid the entire process, but that didn’t help.

It fails at the embedded HTTP server test …

When I open the error logs I find that 2 separate tests appear to fail with this message:
Caused by: java.io.IOException: No language file available named nb at languages/nb!

dnaber · April 3, 2016, 9:16am

No language file available named nb at languages/nb: Please try no as a language code, it’s what the language detector we use seems to expect.

Juan_Martorell · April 3, 2016, 10:59am

Please notice that ISO codes for Norwegian are nb for bokmål and nn for Nynorsk, the two official languages of Norway. There is also a Sami language, however it is spread across countries and it cannot be called Norwegian.

The point is that it would be impolite using no as language code. Maybe Eirik can explain this at the language detector site an solve the issue from the root.

Juan_Martorell · April 3, 2016, 11:08am

That is supert! you will therefore be quite comfortable with the tool. Just get familiar to it.

Well, that is really awesome because the data shown in there is quite close to the format needed to create both tag and synthesizer dictionaries. @MarcinMilkowski is the guru on this, but I can say at first glance that some tiny awk script will do the trick. May I have access to the source of the screenshot?

That is more than enough.

Eirik · April 3, 2016, 12:08pm

Daniel,
I’m fine to use no for now! I don’t do nynorsk, and e.g. Google uses no as a language code on its pages, indicated by ?hl=no query parameter suffix as part of URLs.

Juan,
I will use no for now to get things working, but maybe we should aim for nb / nn …? I have no personal interest in nynorsk (nn), but it would be too bad if a potential contributor skips over the opportunity just because nynorsk is not available. Only 10-15% of Norwegians prefer ‘nynorsk’, but the linguistic community for it is rather significant! (as indicated by the recent printing of a very comprehensive nynorsk dictionary.) My quip with nynorsk is that learning BOTH in school would be like for Americans to have to learn to properly write modern American and modern British English, where each teacher will make a laughing stock out of you if you start mixing forms. It’s a kind of torture

So, Juan, since I am quite biased in this, I think you might present a more coherent case for why nb/nn is needed. if you could use your understanding of the political sensitivities + merge anything I wrote above that might be useful, that would be great!

Here’s the “Norsk ordbank” (Norwegian word repository):

Here’s where I got it originally:
http://www.edd.uio.no/prosjekt/ordbanken/

This same database is used to run the free dictionary “Bokmålsordboka” (http://www.nob-ordbok.uio.no/), and it is very exhaustive for paradigms.

As for reformatting the data, I will use a Node.js CSV parser to reorganize it as needed

Also, apparently a ordbank for nynorsk exists:
http://www.nb.no/sprakbanken/show?serial=sbr-1&lang=nb
So, if a nynorsk user comes around we’d be able to start that as well.

Eirik · April 3, 2016, 12:16pm

Technically speaking, the language detector adheres to ISO 639-1 at present, given the below:

Given the situation of no encompassing both nn and nb according to the above, I suggest that we start this project as no, and we welcome contributions for both bokmål and nynorsk. Then, as soon as someone interested in maintaining nynorsk comes around, we can split into two projects: nn and nb.

Eirik · April 3, 2016, 3:49pm

General progress update:

Compilation passed!

Eirik · April 9, 2016, 12:52pm

I’ve added support for sentence tokenization by editing segment.srx (using Ratel & Notepad++). It appears to work fine, and for now I will be adding new rules as I go along.

I think my next step now is to tokenize words, before finally getting into POS tagging.

But I’m a bit confused:
Q: Do I work with the WordTokenizer first and then add the Chunker, or is it the other way around? If I’m not mistaken, the Chunker increases the accuracy of identifying noun phrases or something along those lines.

dnaber · April 9, 2016, 7:48pm

The chunker is quite an advanced concept, only few languages use it so far. You need to start with the WordTokenizer. It is probably very similar if not identical to other languages.

Eirik · April 10, 2016, 4:55pm

Thanks, it worked!

I’m starting to convert Norwegian data. Here is some data I reworked and renamed to be easier to read:

inflected form | original form | grammar data

fisk	fisk	 noun  masculine  appellative  singular  indefinite 
fisken	fisk	 noun  masculine  appellative  singular  definite 
fisker	fisk	 noun  masculine  appellative  plural  indefinite 
fiskene	fisk	 noun  masculine  appellative  plural  definite

I am thinking it would be best to stick with the English abbreviations like NN, NNS, and similar. The main difference from English id that the indefinite/definite dimension is expressed as word endings instead of as determiners + Norwegian has 3 genders.

My next step is to follow the “Developing a Tagger Dictionary” procedure and see what happens.

Juan_Martorell · April 10, 2016, 5:47pm

As you look quite resourceful, I’d advice to use the EAGLES tagging, good standard, easy to follow and meant for computer processing.

MarcinMilkowski · April 11, 2016, 4:19pm

That’s true. I don’t particularly like our English tagset, as it is not too informative (for example, it does not distinguish inflected and base forms of personal pronouns in tags, they are all marked as PRP, which makes writing rules painful). Try to use a tagset that retains all the information you already have in the original data set.

Eirik · April 22, 2016, 3:15am

@Juan I’m looking into EAGLES, but I’m having a bit of trouble navigating the website. It’s hard to make heads or tails of it to be honest. If you could steer me to a table of POS tags or something that would be of particular relevance, that would be awesome!

For now, just to see what I’m dealing with here, I’m looking into how the existing English tag set maps over to Norwegian. My plan is then to juxtapose this data with EAGLES later.

In case you’re interested, here is the paper that suggested around 95 % POS tagging accuracy for Norwegian:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/801_Paper.pdf
It talks about Freeling and other things.

@Marcin Thanks. I will make sure to preserve all the original data.

Eirik · April 22, 2016, 3:21am

Apparently FreeLing has quite a bit of Norwegian modules now. It appears I might be off to a good start if I can use as much as possible from here: