I am extremely impressed by this tool. It does a lot in a very smart way and it appears to use NLP rules. I noticed that in cases where a name was not recognized (particularly last names) it will mark the word as null/E-NP instead of NNP/I-NNP. I absolutely understand that there will always be limitations on what names are accepted and a comprehensive list of names is all but impossible. That said, what can I do to improve the search and name finding capabilities within this application? Right now, I am going through errors and finding names that the tool does not recognize as names (it attempts to correct the spelling). I am adding them to a blacklist using the “ignore words” capabilities but the correct solution would be to force the NLP engine to mark that word an NNP. How can I go about doing that and sharing the list?
This can be done with disambiguation rules, but they cause some false positives, so one maintainer decided to remove them.
The other (better) approach is to add them individually like what you are doing.
To add the part-of-speech recognition, you can use the file added.txt in the folder ./resource/en (the one that contains the hunspell folder you have been working on).
You can add each name as follows: NAMEtabNAMEtabNNP
E.g. Aldiss Aldiss NNP
to make it easy, just use a tool that is regular expression aware (e.g. notepad ++), copy your list to an empty doc and replace (.*)\r\n with \1\t\1\tNNP\r\n.
The second part (E-NP, I-NNP) is a chunk and it’s generated from OpenNLP (for English text), so don’t be surprised if adding to added.txt doesn’t change the chunk.
I believe the issue was in the regular POS, given that that is the one used in most (>99%) of the rules, and it is the one missing info.
Besides, the chunker defines phrase type, i.e. NP means Noun Phrase., so I believe the purpose of the chunker is not to provide proper noun identification. NNP and I-NNP should only be used by LT tagger.
Yes. I will share the list once I have created it. I assumed that this was created using OpenNLP and I also assume that you generate a name finder model using a name list.
Aafreen, You will never have a list of all proper names. The best we can do is grow a list that includes stuff as it comes up and recreate the model file.
Also, NNP is a proper noun and NN is a noun. NP is Noun Phrase. I think I mistakenly put NP when I meant NN. The POS tagger wasn’t correctly identifying the name as an NNP. Because of that, it would provide recommendations based on nouns and not skip it. One really funny case was Pruit (as in Scott Pruit) was turned to Fruit. The proper fix is to tell the POS tagger that Scott Pruit is a name or update a name dictionary.
Okie. And another example: south africa should be captured as proper noun… (south is captured as JJ)
But only africa is captured and turning into uppercase.