What sort of computer would be necessary to run queries on a LLM?

Question

I've heard that to train a model like GPT 4.0 you need a very powerful computer and ~$10M of computing power, but once you've produced the trained ~570GB model, what sort of computing power is necessary to execute specific queries with it?

score 6 · Accepted Answer · answered May 08 '23 at 16:22

Executing specific queries, in the context of large language models, is referred to as inference. The hardware that runs GPT-4 has not been disclosed. However, Meta's LLaMA can be run on consumer hardware. llama.cpp can run the 7B model on an M1 Pro MacBook – a decent, but not top of the line, computer:

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
Model     Original size   Quantized size (4-bit)
7B        13 GB           3.9 GB
13B       24 GB           7.8 GB
30B       60 GB           19.5 GB
65B       120 GB          38.5 GB

LLaMA-65B can be run on a CPU with 128GB of RAM, although this is unlikely to be efficient compared to renting data centre GPUs. GPT-4 is reported to be a much larger model than LLaMA-65B, with support for a 32K context window. Since the amount of fast GPU memory required scales with the size of the model, and inference cost scales quadratically with input sequence length, GPT-4 inference cannot be performed on consumer hardware.

score 2 · Answer 2 · answered May 21 '23 at 22:22

Likely one or multiple A100 based servers (i.e. servers with 8xA100 cards from Nvidia).

Posssibly the same for H100 cards, but you could likely reduce memory usage with quantification and get a lot more throughput from the same hardware, so can do more queries per time. Memory would be more efficient on quantificatin, otherwise be quite similar to A100 - I hear that 3.5 runs on around 770 or so gb vram. One such node is around 300.000 USD.

4 bit quantification should result in serious savings, possibly giving good performance on high end servers, with - as research shows - POSSIBLY very limited impact (there is a loss of quality that is small and gets smaller on larger models).

The real hardware is not disclosed - you have to wait for large open source models to become available.

What sort of computer would be necessary to run queries on a LLM?

2 Answers2