Inference process and flow, and role of GPU, CPU, and RAM

Asked Jul 06 '23 at 18:58

Active Jul 06 '23 at 19:11

Viewed 32 times

This is a noob question.

I load a HuggingFace transformer model into GPU and create a HuggingFace pipeline using that model. Then I run inference on the model using the pipeline.

I would be glad to read in some depth about the actual process flow of the data, in particular the role of GPU, CPU, and RAM in this process.

For instance,

I see a spike in CPU usage when I run inference. What causes it?
If I have multiple CPUs, and run multiple inference tasks simultaneously, will they be parallelized?
Does it make sense to use something like joblib for inference? Given that I am loading the model into GPU.

edited Jul 06 '23 at 19:11

asked Jul 06 '23 at 18:58

ahron

Inference process and flow, and role of GPU, CPU, and RAM

0 Answers0