For non-English languages (in my case Portuguese), what is the best approach? Should I use the not-so-complete tools in my language, or should I translate the text to English, and after using the tools in English? Lemmatization, for example, is not so good in non-English languages.
Asked
Active
Viewed 87 times
0

nbro
- 39,006
- 12
- 98
- 176

Vinícius Araújo
- 3
- 1
-
1The answer may depend on the task, for instance if your task is Portuguese to Russian translation, the answer may be different than a classifier. I am not sure, because not an NLP expert. However, it is probably worth explaining briefly what kind of processing you are performing with the input written text. – Neil Slater Oct 22 '21 at 06:53
1 Answers
1
Check SpaCy, it's a powerful NLP library that provides lot of different language models, including one for Portuguese.
To answer the more generic question, translating to another language undermines the whole purpose of text pre-processing. Not only will translating generate errors, even when translating to a common language like English, but most importantly, you're forgetting that every language has its own specific linguistic characteristics, like different grammatical genders, tenses, grammar rules for plurals and adjectives, adverbs and so on. By translating you'll throw all that information in the bin.

Oliver Mason
- 5,322
- 12
- 32

Edoardo Guerriero
- 5,153
- 1
- 11
- 25