1

This is a noob question.

I load a HuggingFace transformer model into GPU and create a HuggingFace pipeline using that model. Then I run inference on the model using the pipeline.

I would be glad to read in some depth about the actual process flow of the data, in particular the role of GPU, CPU, and RAM in this process.

For instance,

  1. I see a spike in CPU usage when I run inference. What causes it?
  2. If I have multiple CPUs, and run multiple inference tasks simultaneously, will they be parallelized?
  3. Does it make sense to use something like joblib for inference? Given that I am loading the model into GPU.
ahron
  • 131
  • 6

0 Answers0