Best strategy for inference on multiple GPUs

#124
by symdec - opened

Hello,
A question regarding the serving of this model for a real-time-ish and many users use case.

I'm using this model on a server behind a FastAPI/uvicorn webserver. Right now it is working with the model running on 1 GPU.
I want to increase the serving throughput by using multiple GPUs, with one instance of whisper on each.
Do you know what technologies I can use to make the queueing of http requests and routing to the different instances / GPUs (with some balance) in order to maximize the throughput / minimize the latency ?

Thanks in advance !

Sign up or log in to comment