I use Google's Cloud TPU hardware extensively using Tensorflow for training models and inference, however, when I run inference I do it in large batches. The TPU takes about 3 minutes to warm up before it runs the inference. But when I read the official TPU FAQ, it says that we can do real-time inference using TPU. It says the latency is 10ms which for me is fast enough but I cannot figure out how to write code that does this, since every time I want to pass something for inference I have to start the TPU again.
My goal is to run large Transformer-based Language Models in real-time on TPUs. I guessed that TPUs would be ideal for this problem. Even Google seems to already do this.
Quote from the official TPU FAQ:
Executing inference on a single batch of input and waiting for the result currently has an overhead of at least 10 ms, which can be problematic for low-latency serving.