How to use TPU for real-time low-latency inference?

Asked Nov 01 '19 at 01:08

Active Sep 17 '21 at 05:05

Viewed 414 times

I use Google's Cloud TPU hardware extensively using Tensorflow for training models and inference, however, when I run inference I do it in large batches. The TPU takes about 3 minutes to warm up before it runs the inference. But when I read the official TPU FAQ, it says that we can do real-time inference using TPU. It says the latency is 10ms which for me is fast enough but I cannot figure out how to write code that does this, since every time I want to pass something for inference I have to start the TPU again.

My goal is to run large Transformer-based Language Models in real-time on TPUs. I guessed that TPUs would be ideal for this problem. Even Google seems to already do this.

Quote from the official TPU FAQ:

Executing inference on a single batch of input and waiting for the result currently has an overhead of at least 10 ms, which can be problematic for low-latency serving.

edited Sep 17 '21 at 05:05

bzz

asked Nov 01 '19 at 01:08

adng

How to use TPU for real-time low-latency inference?

0 Answers0