59个回答含有被封锁的答案3个

英伟达发布最强 AI 芯片 B200,性能提升 30 倍,该产品有哪些特性?对 AI 领域意味着什么?

大熊猫
176个点赞 👍

LLM任务可能要告别16bit浮点数做矩阵乘法计算的时代了。

Blackwell架构的Tensor Core支持FP4,可以让MoE模型的训练和推理受益。

To supercharge inference of MoE models, Blackwell Tensor Cores add new precisions, including new community-defined microscaling formats, giving high accuracy and ease of replacement for larger precisions. The Blackwell Transformer Engine utilizes fine-grain scaling techniques called micro-tensor scaling, to optimize performance and accuracy enabling 4-bit floating point (FP4) AI. This doubles the performance and size of next-generation models that memory can support while maintaining high accuracy.

这里提到的microscaling format应该是如下论文提出的方法,作者来自微软、AMD、英特尔、Meta、NVIDIA和高通,说明这个格式已经在业界形成了一定共识。之前FP8对整个Tensor进行统一scale,但是Tensor level Scaling被证明对于低于8bit的格式来说是不够的,因为这些格式的动态范围有限。所以缩放粒度要下降到更小的的sub-blocks of tensors,也就是Microscaling Data Formats

Microscaling Data Formats for Deep Learning

另外,跨机的集合通信也可以用SHARP协议支持FP8。

The NVIDIA NVLink Switch Chip enables 130TB/s of GPU bandwidth in one 72-GPU NVLink domain (NVL72) and delivers 4X bandwidth efficiency with NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ FP8 support. The NVIDIA NVLink Switch Chip supports clusters beyond a single server at the same impressive 1.8TB/s interconnect. Multi-server clusters with NVLink scale GPU communications in balance with the increased computing, so NVL72 can support 9X the GPU throughput than a single eight-GPU system.


使用FP8梯度来做通信,这个之前MSRA的人已经做了一些验证。


Transformer Engine 2.0会支持FP4,期待一下第一个用FP4训练出来的MoE大模型。

评论区有人指出参数还是存储成高精度的。这里在补充一下混合精度训练概念:矩阵乘法的计算虽然用低精度数据格式完成,但是梯度计算结果会累加到高精度的参数上,有的非矩阵乘法计算还是会用高精度。具体细节可以参考我上面的知乎文章。

编辑于 2024-03-19 16:54・IP 属地上海
方佳瑞
自由评论 (0)
分享
Copyright © 2022 GreatFire.org