• gens@programming.dev
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 day ago

    LLMs are limited by memory bandwidth much more then calculating power. You need HBM. Dedicated accelerators only lower power usage.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      12 hours ago

      This is commonly cited, but not strictly true.

      Prompt processing is completely compute limited. And at high batch sizes, where the weights are read once for many tokens generated in parallel, token generation is also quite compute limited. Obviously you want enough bandwidth to match the compute, but its very compute heavy.

      You can see this for yourself. Try ~10 prompts in parallel on a CPU in llama.cpp, and it will slow to a crawl, while a GPU with a narrow bus won’t slow down much.

      Training is a bit more complicated, but that’s not doable on CPUs anyway.

      Now, local inference (aka a batch size of 1), past prompt processing, is heavily bandwidth limited. This is why hybrid inference works alright on CPUs. But this doesn’t really apply to servers, which process many users in parallel with each “pass”.