ignore latin words in a non-latin language

biniam · October 4, 2022, 7:52pm

I know numbers are ignored by default. Is it also possible to ignore latin words in a non latin languages like hindi just like how numbers are ignored. In the tokenizer code, adding ‘\w’ in the TOKENIZING_CHARACTERS or removing these characters pre tokenization, gives weird result.