英伟达发布最强 AI 芯片 B200，性能提升 30 倍，该产品有哪些特性？对 AI 领域意味着什么？ - LLM任务可能要告别16bit浮点数做矩...

LLM任务可能要告别16bit浮点数做矩阵乘法计算的时代了。

Blackwell架构的Tensor Core支持FP4，可以让MoE模型的训练和推理受益。

To supercharge inference of MoE models, Blackwell Tensor Cores add new precisions, including new community-defined microscaling formats, giving high accuracy and ease of replacement for larger precisions. The Blackwell Transformer Engine utilizes fine-grain scaling techniques called micro-tensor scaling, to optimize performance and accuracy enabling 4-bit floating point (FP4) AI. This doubles the performance and size of next-generation models that memory can support while maintaining high accuracy.

这里提到的microscaling format应该是如下论文提出的方法，作者来自微软、AMD、英特尔、Meta、NVIDIA和高通，说明这个格式已经在业界形成了一定共识。之前FP8对整个Tensor进行统一scale，但是Tensor level Scaling被证明对于低于8bit的格式来说是不够的，因为这些格式的动态范围有限。所以缩放粒度要下降到更小的的sub-blocks of tensors，也就是Microscaling Data Formats。

Microscaling Data Formats for Deep Learning

另外，跨机的集合通信也可以用SHARP协议支持FP8。

The NVIDIA NVLink Switch Chip enables 130TB/s of GPU bandwidth in one 72-GPU NVLink domain (NVL72) and delivers 4X bandwidth efficiency with NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ FP8 support. The NVIDIA NVLink Switch Chip supports clusters beyond a single server at the same impressive 1.8TB/s interconnect. Multi-server clusters with NVLink scale GPU communications in balance with the increased computing, so NVL72 can support 9X the GPU throughput than a single eight-GPU system.

使用FP8梯度来做通信，这个之前MSRA的人已经做了一些验证。

Transformer Engine 2.0会支持FP4，期待一下第一个用FP4训练出来的MoE大模型。

评论区有人指出参数还是存储成高精度的。这里在补充一下混合精度训练概念：矩阵乘法的计算虽然用低精度数据格式完成，但是梯度计算结果会累加到高精度的参数上，有的非矩阵乘法计算还是会用高精度。具体细节可以参考我上面的知乎文章。

编辑于 2024-03-19 16:54・IP 属地上海