Extracting POS data from dictionaries

marcoagpinto · May 19, 2020, 12:43pm

Hello!

I have been coding a feature into Proofing Tool GUI that will allow to have dictionaries and extract POS data from them into a .txt file, ready to copy/paste into added.txt .

The main usage for me will be to extract thousands of POS data from proper names (people, countries, cities) from the Portuguese dictionary.

But there is a bonus: I have been adding POS data to the British speller and have so far around 600+ uncountable nouns marked as so.

@Mike_Unwalla @tiff

What is the POS for uncountable nouns in the English added.txt ?

Notice it will still take me some days to have it working.

Thanks!

Mike_Unwalla · May 19, 2020, 1:35pm

Hi Marco,

Non-count nouns are NN:U
Nouns that can be both count nouns and count nouns are NN:UN

Refer to \org\languagetool\resource\en\tagset.txt.

marcoagpinto · May 19, 2020, 2:37pm

@Mike_Unwalla

Please see the screenshot:

Do you mean that uncountable are:
NN:U

And “usually uncountable” are:
NN:U
NN:UN
?

Thanks!

Mike_Unwalla · May 19, 2020, 4:20pm

Hi Marco,

I don’t know the word ‘citalopram’. I guess that it is uncountable (NN:U), but I am not sure. The NHS uses the word as an uncountable noun (mass noun): Citalopram: a medicine that treats low mood and panic attacks - NHS.

Lexico tells me that ‘citriculture’ is a mass noun (Dictionary.com | Meanings and Definitions of Words at Dictionary.com). But, LT has the plural POS NNS. Many words that standard dictionaries show as non-count can be used as count nouns in some contexts.

I don’t mean “usually” in any context. I mean “can”, even if that is one in a million.

I mean:

If a noun is count-only, it is NN. Example: computer.
If a noun is non-count (uncountable, mass) only, it is NN:U. Example: antidisestablishmentarianism (Dictionary.com | Meanings and Definitions of Words at Dictionary.com).
If a noun can be both countable and uncountable, it is NN:UN. Example: oil.

But, I read the explanation in tagset.txt again:
NN Noun, singular or mass: bicycle, earthquake, zipper
I think the text is not correct. I think it should be:
Noun, singular count noun: bicycle, earthquake, zipper

@danielnaber, @tiff, can you clarify the meaning of NN? Is the explanation in tagset.txt correct?

marcoagpinto · May 22, 2020, 1:44pm

Hello!

I have coded the function in Proofing Tool GUI that extracts POS as text files, ready to copy/paste into added.txt.

In the past few months I have been adding POS data to the British dictionary, mainly “uncountable” and “usually uncountable” nouns.

I have been adding POS data from Wiktionary, since it is based on a real physical dictionary.

I just need to know if “usually uncountable nouns” are:
NN:UN

Here is a few examples of what I extracted:

aerospace	aerospace	NN:UN
aerotropism	aerotropism	NN:U
affinition	affinition	NN:U
Africanity	Africanity	NN:U
Afrikanerdom	Afrikanerdom	NN:U
Afrocentrism	Afrocentrism	NN:U
Afrofuturism	Afrofuturism	NN:U
afterdamp	afterdamp	NN:UN
aftergrass	aftergrass	NN:UN
agamospermy	agamospermy	NN:U
agflation	agflation	NN:U
aggressivity	aggressivity	NN:UN

@Mike_Unwalla @tiff @dnaber

Could you comment regarding the usually uncountable nouns?

It takes one or two seconds to extract the whole list from the GB .dic.

Mike_Unwalla · May 25, 2020, 12:22pm

@marcoagpinto, I cannot help you. As best I know, there is no postag for “usually uncountable” nouns.

marcoagpinto · May 25, 2020, 1:02pm

So, I am only extracting “uncountable nouns” as of the date 25/MAY/2020.

In the past few months I have been adding morphologic information to the GB speller, but tons are still missing.

See if the attached files are helpful.

Thanks!

GB_uncountable_20200525.zip (11.4 KB)

Mike_Unwalla · May 26, 2020, 9:02am

@marco, I looked at GB_uncountable_addedtxt_20200525.txt and GB_uncountable_spellingtxt_20200525.txt, which are in GB_uncountable_20200525.zip.

I do not understand what you are doing. What does “adding morphologic information to the GB speller” mean? What information do you add and where do you add it?

File GB_uncountable_addedtxt_20200525.txt contains POS information in the same structure that we use in added.txt. But, added.txt is for all variants of English, not only for en-GB.

Why is it necessary to add POS or spelling for actinium? The POS is in LT and is not recently added (the POS is in LT 4.8). LT 4.8 correctly gives no spelling warning for actinium in any of the language variants.

This screen shot of LT 4.8 shows that most words in GB_uncountable_spellingtxt_20200525.txt do not give a spelling warning for BrE:

The Tagger Result dialog shows that some of the words already have the postag NN:U.

Thus, I am confused.

marcoagpinto · May 26, 2020, 9:36am

@Mike_Unwalla

I am the maintainer of the British speller for ~7 years.

For several months that, while I search for possessives and plurals in words already in the dictionary (and new ones), I have been adding extra information for the words in the .dic, such as “Noun: Uncountable blah blah” based on Wiktionary.

Now I have coded a feature into my Hunspell tool “Proofing Tool GUI” that allows to extract words with defined POS information.

In the zip above, In Proofing Tool GUI I simply added a “source POS” (.dic) and a “target POS” (.txt - LanguageTool) and extract both in added.txt and spelling.txt format.

Does this explain well?

Thanks!

marcoagpinto · May 26, 2020, 9:38am

Ahhhh… I extracted the whole GB .dic, so it is possible that tons of words were already in LT.

Also, the dictionary is British, but the words are valid in other variants.

Mike_Unwalla · May 26, 2020, 10:09am

Yes, thank you.

I know nothing about the .dic file.

What, if anything, do you want me to do now?

marcoagpinto · May 26, 2020, 10:25am

@Mike_Unwalla

Simply insert the missing POSes in the English added.txt and spelling.txt

Mike_Unwalla · May 26, 2020, 10:50am

@marcoagpinto, will do.

Mike_Unwalla · May 26, 2020, 2:36pm

I added the missing POS in [en] Add POS NN:U · languagetool-org/languagetool@b7b9c23 · GitHub. But, the message is:
Showing with 790 additions and 104 deletions .

I thought that I had added 686 missing POS only. I will try again.

Mike_Unwalla · May 26, 2020, 2:48pm

Done the POS ([en] Add POS NN:U · languagetool-org/languagetool@6ae9742 · GitHub).

I misunderstood the message. 790-104=686. So, there wasn’t an error the first time.

Done the spellings ([en] Add spellings · languagetool-org/languagetool@f385c4d · GitHub).

For each word that gave an error for BrE, I checked the word on www.lexico.com and www.merriam-webster.com. If I thought that the word is applicable to all variants of English, I added it to spelling.txt. If I wasn’t sure, I added it to spelling_en-GB.txt.

I didn’t add ‘benzedrine’, because that is derived from a proper noun. (LT spell check suggests Benzedrine.) When we have a rule for benzedrine/Benzedrine, we can add the lower-case spelling.

marcoagpinto · May 26, 2020, 4:33pm

cool