This is a noob question.
I load a HuggingFace transformer model into GPU and create a HuggingFace pipeline using that model. Then I run inference on the model using the pipeline.
I would be glad to read in some depth about the actual process flow of the data, in particular the role of GPU, CPU, and RAM in this process.
For instance,
- I see a spike in CPU usage when I run inference. What causes it?
- If I have multiple CPUs, and run multiple inference tasks simultaneously, will they be parallelized?
- Does it make sense to use something like joblib for inference? Given that I am loading the model into GPU.