• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    16 hours ago

    Not to speak of the new attention scheme and the (IIRC) MLP changes.

    I’m very much looking forward to ik_llama.cpp implementing it. I don’t think I can quite fit Flash on my rig (hence no Ktransformers for me) but a little quantization of the sparse layers, and it’d be perfect.

    • KingRandomGuy@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      14 hours ago

      Makes sense, even Flash is fairly sizable! KTransformers also has a “llamafile” backend which uses GGUFs, but ik_llama will almost certainly perform better if you’re not on a NUMA setup. In my case, I’m using a dual socket motherboard, so KTransformers performs quite a bit better (I think ik_llama hasn’t implemented extensive NUMA optimizations quite yet, but sounds like it’s coming), though I normally use KTransformers for native FP8 weights.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        13 hours ago

        It is! 143GB last I checked. I’m on 128GB RAM + 3090, 1 NUMA node, so I think it’s juuust barely too tight. But it should be perfect with a few of the “sparsest” MoEs quantized.

        If KTransformers supports something like that, I may have to finally check it out, since v4 won’t need many esoteric features.