Extracting POS data from dictionaries

Hello!

I have been coding a feature into Proofing Tool GUI that will allow to have dictionaries and extract POS data from them into a .txt file, ready to copy/paste into added.txt .

The main usage for me will be to extract thousands of POS data from proper names (people, countries, cities) from the Portuguese dictionary.

But there is a bonus: I have been adding POS data to the British speller and have so far around 600+ uncountable nouns marked as so.

@Mike_Unwalla @tiff

What is the POS for uncountable nouns in the English added.txt ?

Notice it will still take me some days to have it working.

Thanks!

Hi Marco,

Non-count nouns are NN:U
Nouns that can be both count nouns and count nouns are NN:UN

Refer to \org\languagetool\resource\en\tagset.txt.

@Mike_Unwalla

Please see the screenshot:

Do you mean that uncountable are:
NN:U

And “usually uncountable” are:
NN:U
NN:UN
?

Thanks!

Hi Marco,

I don’t know the word ‘citalopram’. I guess that it is uncountable (NN:U), but I am not sure. The NHS uses the word as an uncountable noun (mass noun): Citalopram: a medicine that treats low mood and panic attacks - NHS.

Lexico tells me that ‘citriculture’ is a mass noun (Dictionary.com | Meanings and Definitions of Words at Dictionary.com). But, LT has the plural POS NNS. Many words that standard dictionaries show as non-count can be used as count nouns in some contexts.

I don’t mean “usually” in any context. I mean “can”, even if that is one in a million.

I mean:

But, I read the explanation in tagset.txt again:
NN Noun, singular or mass: bicycle, earthquake, zipper
I think the text is not correct. I think it should be:
Noun, singular count noun: bicycle, earthquake, zipper

@danielnaber, @tiff, can you clarify the meaning of NN? Is the explanation in tagset.txt correct?

Hello!

I have coded the function in Proofing Tool GUI that extracts POS as text files, ready to copy/paste into added.txt.

In the past few months I have been adding POS data to the British dictionary, mainly “uncountable” and “usually uncountable” nouns.

I have been adding POS data from Wiktionary, since it is based on a real physical dictionary.

I just need to know if “usually uncountable nouns” are:
NN:UN

Here is a few examples of what I extracted:

aerospace aerospace NN:UN
aerotropism aerotropism NN:U
affinition affinition NN:U
Africanity Africanity NN:U
Afrikanerdom Afrikanerdom NN:U
Afrocentrism Afrocentrism NN:U
Afrofuturism Afrofuturism NN:U
afterdamp afterdamp NN:UN
aftergrass aftergrass NN:UN
agamospermy agamospermy NN:U
agflation agflation NN:U
aggressivity aggressivity NN:UN

@Mike_Unwalla @tiff @dnaber

Could you comment regarding the usually uncountable nouns?

It takes one or two seconds to extract the whole list from the GB .dic.

@marcoagpinto, I cannot help you. As best I know, there is no postag for “usually uncountable” nouns.

So, I am only extracting “uncountable nouns” as of the date 25/MAY/2020.

In the past few months I have been adding morphologic information to the GB speller, but tons are still missing.

See if the attached files are helpful.

Thanks!

GB_uncountable_20200525.zip (11.4 KB)

@marco, I looked at GB_uncountable_addedtxt_20200525.txt and GB_uncountable_spellingtxt_20200525.txt, which are in GB_uncountable_20200525.zip.

I do not understand what you are doing. What does “adding morphologic information to the GB speller” mean? What information do you add and where do you add it?

File GB_uncountable_addedtxt_20200525.txt contains POS information in the same structure that we use in added.txt. But, added.txt is for all variants of English, not only for en-GB.

Why is it necessary to add POS or spelling for actinium? The POS is in LT and is not recently added (the POS is in LT 4.8). LT 4.8 correctly gives no spelling warning for actinium in any of the language variants.

This screen shot of LT 4.8 shows that most words in GB_uncountable_spellingtxt_20200525.txt do not give a spelling warning for BrE:

The Tagger Result dialog shows that some of the words already have the postag NN:U.

Thus, I am confused.

@Mike_Unwalla

I am the maintainer of the British speller for ~7 years.

For several months that, while I search for possessives and plurals in words already in the dictionary (and new ones), I have been adding extra information for the words in the .dic, such as “Noun: Uncountable blah blah” based on Wiktionary.

Now I have coded a feature into my Hunspell tool “Proofing Tool GUI” that allows to extract words with defined POS information.

In the zip above, In Proofing Tool GUI I simply added a “source POS” (.dic) and a “target POS” (.txt - LanguageTool) and extract both in added.txt and spelling.txt format.

Does this explain well?

Thanks!

Ahhhh… I extracted the whole GB .dic, so it is possible that tons of words were already in LT.

Also, the dictionary is British, but the words are valid in other variants.

Yes, thank you.

I know nothing about the .dic file.

What, if anything, do you want me to do now?

@Mike_Unwalla

Simply insert the missing POSes in the English added.txt and spelling.txt :slight_smile:

@marcoagpinto, will do.

I added the missing POS in [en] Add POS NN:U · languagetool-org/languagetool@b7b9c23 · GitHub. But, the message is:
Showing with 790 additions and 104 deletions .

I thought that I had added 686 missing POS only. I will try again.

Done the POS ([en] Add POS NN:U · languagetool-org/languagetool@6ae9742 · GitHub).

I misunderstood the message. 790-104=686. So, there wasn’t an error the first time.

Done the spellings ([en] Add spellings · languagetool-org/languagetool@f385c4d · GitHub).

For each word that gave an error for BrE, I checked the word on www.lexico.com and www.merriam-webster.com. If I thought that the word is applicable to all variants of English, I added it to spelling.txt. If I wasn’t sure, I added it to spelling_en-GB.txt.

I didn’t add ‘benzedrine’, because that is derived from a proper noun. (LT spell check suggests Benzedrine.) When we have a rule for benzedrine/Benzedrine, we can add the lower-case spelling.

cool :slight_smile: