• KingRandomGuy@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    13 hours ago

    Makes sense, even Flash is fairly sizable! KTransformers also has a “llamafile” backend which uses GGUFs, but ik_llama will almost certainly perform better if you’re not on a NUMA setup. In my case, I’m using a dual socket motherboard, so KTransformers performs quite a bit better (I think ik_llama hasn’t implemented extensive NUMA optimizations quite yet, but sounds like it’s coming), though I normally use KTransformers for native FP8 weights.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      13 hours ago

      It is! 143GB last I checked. I’m on 128GB RAM + 3090, 1 NUMA node, so I think it’s juuust barely too tight. But it should be perfect with a few of the “sparsest” MoEs quantized.

      If KTransformers supports something like that, I may have to finally check it out, since v4 won’t need many esoteric features.