Word2Vec confusion pairs

tiagosantos · February 21, 2018, 7:07pm

Maybe you have already seen this:

Gulp21 has used a public corpus from magazines (the link is in the thread), that seems perfect for this task.
I would pass you my stuff but I am away from home and I am not using my development computer, so it will take a few months to pass to you the ‘starter’s kit’.

If you decide to do this for Portuguese I’ll be happy.

If possible use a corpus with only one language variant (just Brazilian Portuguese or just European Portuguese), even if not making a variant specific model, or you will have sub-optimal results in some circumstances.

I would argue against using Wikipedia as a corpus, since text have many typos and they mix both variants, plus the spelling agreement varieties, sometimes within the same sentence, possibly creating biases and errors on the models.

When training the model, the more the merrier, although it will be an extra burden processing more data.

Regarding this topic, @gulp21 is the expert and he will be the best person to give advice, so I recommend posting further enquiries on that thread.