Word2Vec confusion pairs

Ferch42 · February 21, 2018, 4:39pm

Hello everyone

I would like to help building the Word2Vec models for confusion pairs. For that I would have to find a large córpus with examples right?. How would that córpus be like? Just sentences with the confusion pairs? And how big does it need to be?

Ruud_Baars · February 21, 2018, 5:01pm

If you would do so for Dutch, I can give a corpus.

tiagosantos · February 21, 2018, 7:07pm

Hi @Ferch42,

Maybe you have already seen this:

Gulp21 has used a public corpus from magazines (the link is in the thread), that seems perfect for this task.
I would pass you my stuff but I am away from home and I am not using my development computer, so it will take a few months to pass to you the ‘starter’s kit’.

If you decide to do this for Portuguese I’ll be happy.

If possible use a corpus with only one language variant (just Brazilian Portuguese or just European Portuguese), even if not making a variant specific model, or you will have sub-optimal results in some circumstances.

I would argue against using Wikipedia as a corpus, since text have many typos and they mix both variants, plus the spelling agreement varieties, sometimes within the same sentence, possibly creating biases and errors on the models.

When training the model, the more the merrier, although it will be an extra burden processing more data.

Regarding this topic, @gulp21 is the expert and he will be the best person to give advice, so I recommend posting further enquiries on that thread.

gulp21 · February 23, 2018, 8:34am

It should be a large corpus which contains almost no errors. In my experiments, I used corpora with more than 1 million sentences. It is ok if the corpus contains some errors, but not too many; so newspaper articles are a good candidate, but forum posts are not. The corpus file is a plain text file containing full sentences. Do not limit the corpus to a specific confusion pair, but try to cover a wide range of styles (e.g. newspaper articles, books, etc.), although it can be hard/impossible to get free corpora; I’ve only used newspaper articles and tatoeba and got good results (see the thread linked in the previous post).

By the way, it is probably a good idea to document (in a readme or maybe the Wiki) which sources has been used for a corpus. This can come in handy if someone wants to do further experiments with machine learning.

Ferch42 · February 24, 2018, 1:31pm

OK! Thanks. I will take a look into it.