How can I improve proper name recognition?

bholland · January 16, 2018, 8:50pm

Hello,

I am extremely impressed by this tool. It does a lot in a very smart way and it appears to use NLP rules. I noticed that in cases where a name was not recognized (particularly last names) it will mark the word as null/E-NP instead of NNP/I-NNP. I absolutely understand that there will always be limitations on what names are accepted and a comprehensive list of names is all but impossible. That said, what can I do to improve the search and name finding capabilities within this application? Right now, I am going through errors and finding names that the tool does not recognize as names (it attempts to correct the spelling). I am adding them to a blacklist using the “ignore words” capabilities but the correct solution would be to force the NLP engine to mark that word an NNP. How can I go about doing that and sharing the list?

Thanks,
~Ben

tiagosantos · January 17, 2018, 8:47am

This can be done with disambiguation rules, but they cause some false positives, so one maintainer decided to remove them.

The other (better) approach is to add them individually like what you are doing.
To add the part-of-speech recognition, you can use the file added.txt in the folder ./resource/en (the one that contains the hunspell folder you have been working on).

You can add each name as follows:
NAMEtabNAMEtabNNP
E.g. Aldiss Aldiss NNP

to make it easy, just use a tool that is regular expression aware (e.g. notepad ++), copy your list to an empty doc and replace (.*)\r\n with \1\t\1\tNNP\r\n.

bholland · January 17, 2018, 4:46pm

Oh now that is easy. Awesome!

Yea. It is better to add them individually. I am very happy that I can POS tag words. That is a huge improvement.

dnaber · January 17, 2018, 8:50pm

The second part (E-NP, I-NNP) is a chunk and it’s generated from OpenNLP (for English text), so don’t be surprised if adding to added.txt doesn’t change the chunk.

tiagosantos · January 18, 2018, 9:10am

I believe the issue was in the regular POS, given that that is the one used in most (>99%) of the rules, and it is the one missing info.
Besides, the chunker defines phrase type, i.e. NP means Noun Phrase., so I believe the purpose of the chunker is not to provide proper noun identification. NNP and I-NNP should only be used by LT tagger.

Do not forget to share. It will be appreciated.

aafreen · January 18, 2018, 10:18am

How one can add every proper noun in a list? And please share the list if you have done.

bholland · January 18, 2018, 5:17pm

Yes. I will share the list once I have created it. I assumed that this was created using OpenNLP and I also assume that you generate a name finder model using a name list.

Aafreen, You will never have a list of all proper names. The best we can do is grow a list that includes stuff as it comes up and recreate the model file.

Also, NNP is a proper noun and NN is a noun. NP is Noun Phrase. I think I mistakenly put NP when I meant NN. The POS tagger wasn’t correctly identifying the name as an NNP. Because of that, it would provide recommendations based on nouns and not skip it. One really funny case was Pruit (as in Scott Pruit) was turned to Fruit. The proper fix is to tell the POS tagger that Scott Pruit is a name or update a name dictionary.

aafreen · January 19, 2018, 5:07am

Okie. And another example:
south africa should be captured as proper noun… (south is captured as JJ)
But only africa is captured and turning into uppercase.

bholland · January 19, 2018, 3:54pm

Right. So in your case, you would want South Africa to all be on one line and all marked as NNP.

aafreen · January 22, 2018, 10:04am

Yes.
south africa ----> south Africa ------> South Africa(It’s only the correct thing.)