0

I am applying spacy lemmatization on my dataset, but already 20-30 mins passed and the code is still running.

Is there anyway to make it faster? Is there any option to do this process using GPU?

My dataset size is 20k number of rows & 3 columns

Cathrine
  • 11
  • 1
  • 5
  • Are you using https://spacy.io/api/lemmatizer? I don't know how lemmatization works, but have you tried to parallelize your code? For example, can you split your dataset into sub-datasets and perform lemmatization on each subset on its separate CPU core or GPU? – nbro Jun 17 '20 at 14:48

1 Answers1

4

https://spacy.io/api/lemmatizer just uses lookup tables and the only upstream task it relies on is POS tagging, so it should be relatively fast. For large amounts of text, SpaCy recommends using nlp.pipe, which can work in batches and has built in support for multiprocessing (with the n_process keyword), rather than than simply nlp.

Also, make sure you disable any pipeline elements that you don't plan to use, as they'll just waste processing time. If you're only doing lemmatization, you'll pass disable=["parser", "ner"] to the nlp.pipe call.

Example code that takes all of the above into account is below.

import spacy
nlp = spacy.load("en_core_web_sm")

docs = ["We've been running all day.", "Let's be better."]

for doc in nlp.pipe(docs, batch_size=32, n_process=3, disable=["parser", "ner"]):
    print([tok.lemma_ for tok in doc])

# ['-PRON-', 'have', 'be', 'run', 'all', 'day', '.']
# ['let', '-PRON-', 'be', 'well', '.']
```
primussucks
  • 346
  • 1
  • 4