(ES) Spanish Proper Names

vherox · April 10, 2015, 6:56am

Hello,

I am using the Language Tool API for lemmatization and postagging in Spanish. I get the expected results in most cases excepting for proper names, which are not recognized.

Below is my code:
public class Main {

public static void main(String[] args) throws IOException {
	JLanguageTool langTool = new JLanguageTool(new Spanish());
	List<AnalyzedSentence> cats	 = langTool.analyzeText("María es de Madrid");		
	AnalyzedSentence as = cats.get(0);
	System.out.println("AS 1: " + as);	
	List<AnalyzedTokenReadings> readings = Arrays.asList(as.getTokens());
	for (AnalyzedTokenReadings reading : readings ){
		AnalyzedToken token = reading.getAnalyzedToken(0);
		if (token.getLemma() != null){
			System.out.print("token: "+token);
			System.out.print(" lemma: "+token.getLemma());
			System.out.println(" POS: "+token.getPOSTag());
		}
	}
}

}

The output of the above is:

AS 1: ~~María[maría/NCFS000] es[ser/VSIP3S0] de[de/NCFS000,de/SPS00] Madrid[~~

]
token: maría/NCFS000 lemma: maría POS: NCFS000
token: ser/VSIP3S0 lemma: ser POS: VSIP3S0
token: de/NCFS000 lemma: de POS: NCFS000

where “María” does not have a proper name POS tag (NP…) and “Madrid” has neither a lemma nor a POS tag.

Does anyone know if I need any extra libraries or files in order to get lemmas/POS tags for Spanish proper names?

Thanks in advance!

Juan_Martorell · April 10, 2015, 10:08am

First and foremost, thank you for your interest in LanguageTool.

Short answer is that LT does not support poper nouns in Spanish.

You don’t get a tag for the last word because it is interpreted as end of text token. Try with

María es de Madrid.

You’ll get something like

<S> María[maría/NCFS000,] es[ser/VSIP3S0,] de[de/NCFS000,de/SPS00,] Madrid[Madrid/null,].[</S>,]

For

Juan es madrileño.

you may get

<S> Juan[Juan/null,] es[ser/VSIP3S0,] madrileño[madrileño/AQ0MS0,madrileño/NCMS000,].[</S>,]

What happens there is that maría is included in the dictionary as a common noun, maybe due to error. The dictionary used in LT comes from other project, Freeling, and it came with some obvious errors, some subtle errors and some inconsistencies.

For this case, it comes with maría and pepe, but it lacks Juan or Pedro. Place names are excluded as well, but we can find some demonyms like madrileño or zaragozana.

One way to detect proper nouns is when they have a null POS (like in Madrid or Juan), but there is a risk. Some compound words (like escribiéndose) are neither detected. To illustrate this, consider the example:

Vistiéndose deprisa, Julia tardará más en llegar a Villalba por la carretera de El Pardo con su Toyota que en TALGO.

<S> Vistiéndose[Vistiéndose/null,] deprisa[deprisa/RG,],[,/null,] Julia[julia/NCFS000,] tardará[tardar/VMIF3S0,] más[más/RG,] en[en/SPS00,] llegar[llegar/VMN0000,] a[a/NCFS000,a/SPS00,] Villalba[Villalba/null,] por[por/SPS00,] la[el/DA0FS0,] carretera[carretera/NCFS000,carretero/NCFS000,] de[de/NCFS000,de/SPS00,] El[el/DA0MS0,] Pardo[pardo/NCMS000,] con[con/SPS00,] su[su/DP3CS0,] Toyota[Toyota/null,] que[que/CS,que/PR0CN000,] en[en/SPS00,] TALGO[talgo/NCMS000,].[</S>,]

So you notice you cannot rely on capitals or undetected POS, but maybe you can get some result with combination of the both plus statistics.

Please share your findings with us.

HTH

vherox · April 11, 2015, 1:16am

Thank you for the prompt response and clarifications! I’ll repost if I find a valid solution.

Best,

Verónica.