• KingRandomGuy@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    13 hours ago

    Yeah I can believe their interconnect is better, given their extensive history in networking.

    W.r.t TFLOPs, let me clarify what I meant. Even on traditionally compute-bound workloads (attention, etc.), on H200 it’s actually surprisingly difficult to make full use of the card’s throughput before hitting VRAM bandwidth limits. Tensor core throughput has grown a lot faster than bandwidth has.

    I’ve never written a kernel for Huawei chips so I have no idea if they have the same problem. But this problem is there on many datacenter-class NVIDIA chips, which is why they keep introducing features (TMA, TMEM, etc.) to try and lower the time wasted waiting for memory.