DeepSeek ditches Nvidia for Huawei chips in V4 launch

inari@piefed.zip · 2 days ago

DeepSeek ditches Nvidia for Huawei chips in V4 launch

gens@programming.dev · 1 day ago

LLMs are limited by memory bandwidth much more then calculating power. You need HBM. Dedicated accelerators only lower power usage.

brucethemoose@lemmy.world · 12 hours ago

This is commonly cited, but not strictly true.

Prompt processing is completely compute limited. And at high batch sizes, where the weights are read once for many tokens generated in parallel, token generation is also quite compute limited. Obviously you want enough bandwidth to match the compute, but its very compute heavy.

You can see this for yourself. Try ~10 prompts in parallel on a CPU in llama.cpp, and it will slow to a crawl, while a GPU with a narrow bus won’t slow down much.

Training is a bit more complicated, but that’s not doable on CPUs anyway.

Now, local inference (aka a batch size of 1), past prompt processing, is heavily bandwidth limited. This is why hybrid inference works alright on CPUs. But this doesn’t really apply to servers, which process many users in parallel with each “pass”.

DeepSeek ditches Nvidia for Huawei chips in V4 launch

DeepSeek ditches Nvidia for Huawei chips in V4 launch

Just a moment...