@KingRandomGuy

KingRandomGuy@lemmy.world · 12 hours ago

Yeah I can believe their interconnect is better, given their extensive history in networking.

W.r.t TFLOPs, let me clarify what I meant. Even on traditionally compute-bound workloads (attention, etc.), on H200 it’s actually surprisingly difficult to make full use of the card’s throughput before hitting VRAM bandwidth limits. Tensor core throughput has grown a lot faster than bandwidth has.

I’ve never written a kernel for Huawei chips so I have no idea if they have the same problem. But this problem is there on many datacenter-class NVIDIA chips, which is why they keep introducing features (TMA, TMEM, etc.) to try and lower the time wasted waiting for memory.

KingRandomGuy@lemmy.world · 1 day ago

You can actually get kind of acceptable performance on CPU alone, but you need rather specific CPUs, like SPR or newer Intel Xeons. These support AMX, which is almost like a mini tensor core, so you can actually get decent throughput in TFLOPs out of GNR Xeons. Memory bandwidth with max channels is also acceptable, something like ~800 GB/s per socket with maxed out MRDIMMs, which is not too far behind consumer GPUs like 3090 and 4090.

Not anywhere near the performance of real GPUs of course, and not something acceptable for scale or production workloads, but good enough for local inference.

KingRandomGuy@lemmy.world · 1 day ago

Makes sense, even Flash is fairly sizable! KTransformers also has a “llamafile” backend which uses GGUFs, but ik_llama will almost certainly perform better if you’re not on a NUMA setup. In my case, I’m using a dual socket motherboard, so KTransformers performs quite a bit better (I think ik_llama hasn’t implemented extensive NUMA optimizations quite yet, but sounds like it’s coming), though I normally use KTransformers for native FP8 weights.

KingRandomGuy@lemmy.world · 1 day ago

Yeah, I’d expect KTransformers to add support eventually, especially considering their existing support for previous DeepSeek models. One of the tricky parts is that backends need both FP8 and MXFP4 support. As far as I’m aware no inference engine supports both on CPU at the moment (llama.cpp added fp4 support recently, but doesn’t have fp8, while kt-kernel doesn’t support fp4 yet).

KingRandomGuy@lemmy.world · 1 day ago

To be fair, the raw FLOPs count doesn’t tell the whole story. On a lot of workloads (including token generation during LLM inference), you’re bound by the memory bandwidth rather than throughput/FLOPs. On H100/H200, keeping the tensor cores fully occupied is surprisingly difficult, and that’s with 3+ TB/s of memory bandwidth. And I believe those cards have much higher throughput (at least at FP8, Ascend wins at FP4 since H100/200 don’t support it) compared to Ascend.

The Ascend 950PR units have far lower memory bandwidth, reportedly at 1.4 TB/s. Compare that to Blackwell, which has something like 8TB/s of bandwidth. I believe they’re manufacturing their own kind of HBM, so that’s still really impressive considering this is a fairly recent push into manufacturing accelerators. But I’m a bit skeptical it actually outperforms NVIDIA at scale.