I am thinking it would be best to stick with the English abbreviations like NN, NNS, and similar. The main difference from English id that the indefinite/definite dimension is expressed as word endings instead of as determiners + Norwegian has 3 genders.
My next step is to follow the “Developing a Tagger Dictionary” procedure and see what happens.
That’s true. I don’t particularly like our English tagset, as it is not too informative (for example, it does not distinguish inflected and base forms of personal pronouns in tags, they are all marked as PRP, which makes writing rules painful). Try to use a tagset that retains all the information you already have in the original data set.
@Juan I’m looking into EAGLES, but I’m having a bit of trouble navigating the website. It’s hard to make heads or tails of it to be honest. If you could steer me to a table of POS tags or something that would be of particular relevance, that would be awesome!
For now, just to see what I’m dealing with here, I’m looking into how the existing English tag set maps over to Norwegian. My plan is then to juxtapose this data with EAGLES later.
There are less than than the 1.1 million forms you found before, but they have a predictable EAGLES format, which is a good starting point and besides you can grow it later.
I’m with an ongoing effort to grow the Spanish dictionary in my lab. Feel free to browse it and asking questions and making comments on it. If people around find that work useful, I can turn it easily into a language agnostic project.
I was able to create a binary dictionary file now. However, when trying to restore a raw text file from the created binary, I run into an error:
An unhandled exception occurred. Stack trace below.
java.io.IOException: Invalid file header, probably not an FSA.
at morfologik.fsa.FSAHeader.read(FSAHeader.java:45)
at morfologik.fsa.FSA.read(FSA.java:312)
at morfologik.stemming.Dictionary.read(Dictionary.java:102)
at morfologik.stemming.Dictionary.read(Dictionary.java:65)
at morfologik.tools.DictDecompile.call(DictDecompile.java:62)
at morfologik.tools.DictDecompile.call(DictDecompile.java:20)
at morfologik.tools.CliTool.main(CliTool.java:133)
at morfologik.tools.DictDecompile.main(DictDecompile.java:132)
at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:80)
at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59)
Done. The dictionary export has been written to dictionary.dump
I don’t know if this is a big deal, but it seems like something might have gotten wrong with the dict creation since it’s unable to restore the original text from the binary.
My original text file when creating the data was tab-separated (I did it myself first, and then later used freeling2lt.pl in Spanish repo)
I’ve added the tagger now! I think my next step will be to just add some rules and get them working.
@Juan_Martorell I’m finding the Eagles tags a bit hard to read. Would there be some way to alias these tags for the sake of rule creation?
I could consider writing an XML processor for this task if there isn’t. There’s an open issue on Github, so I might write there.
The goal would be to allow us to use a simpler more verbose syntax, so that people who are familiar with English grammar but not with Eagles, could just have their input converted later.
EDIT: Just want to report that the POS tagger works for Norwegian; i.e. I’m able to write rules using them! Amazing! I feel like I can now focus on writing the first 50-100 rules, then perhaps add more features when things get more familiar.
Once you get used to them, they are not that arid. In fact, to me they look tidy now. My suggestion is that you print a cheatsheet on paper like I have with the codes, descriptions and a sample.
Great! try to implement this easy rule:
Jeg gikk til han → suggest ham
Di venter for han → suggest ham
With synthesis you can also put into the same rule
Jeg gikk til hun → suggest hennes
The latter is less seen (just perpetrated by nøwbies like me), but in order to practice it does not hurt.
My Norwegian grammar is awful, but I hope you understand what I mean.
Try mine. It’s made of another document. Though it is in Spanish, it’s not difficult to figure it out what is what. Please fell free to ask should you have trouble with it.
I guess the coding is similar: just match the position. First is the category, second is type. From there, depending on the type you get all the possible values.
Thanks for the cheatsheet! It’s starting to make more sense.
My next plan now is to make ~100 rules. I currently have material for only about 10, based on my own errors.
I’m thinking it would be fun to perhaps run checks Norwegian Wikipedia (tbh there’s a lot of machine-translated garbage there, depending on article … so a focus on rooting out MT-generated anglicisms might be beneficial to a lift in quality).
If you have any requests as to the sort of errors you’d like covered, that would be nice. Otherwise I will start implementing any ‘top 100 errors’ I find online, which are not covered by a spelling dictionary.
I’d like to focus on adding rules for errors that people actually make somewhat frequently, so having a corpus to go by would be quite useful (though I guess I can use Google to get an indication of frequency).
One more thing: Do you know if it would be possible to host my own languagetoo.org website in order to add temporary support for Norwegian on the front page? This way testing Norwegian would be much easier.
An idea about the rules: I don’t know how many people are learning Norwegian, but you could also consider adding rules for language learners. Most languages in LT are not optimized for that use case.
Thanks, I’ll set this up on my Nginx! This way I could invite people who may be interested in contributing to paste their Norwegian onto my test website
I think I’ll take this approach:
Add errors I make myself
Common error lists I find online
Common errors made by learners
(3) might make it possible to get some assistance from Tatoeba.
Also, I will be testing the wikipedia module later, so that I might test my rules on a larger corpus. This should prove interesting and quicker than working in my own sandbox where I’m currently writing my own test sentences in a text file!
Please press F12 to open the development panel in your browser. In the network tab, you should see more information about why the request fails. Also, we’re just in the process of switching to a new API, so please make sure you’re using the very latest version of the website and the LT server.