LLM任务可能要告别16bit浮点数做矩阵乘法计算的时代了。
Blackwell架构的Tensor Core支持FP4,可以让MoE模型的训练和推理受益。
To supercharge inference of MoE models, Blackwell Tensor Cores add new precisions, including new community-defined microscaling formats, giving high accuracy and ease of replacement for larger precisions. The Blackwell Transformer Engine utilizes fine-grain scaling techniques called micro-tensor scaling, to optimize performance and accuracy enabling 4-bit floating point (FP4) AI. This doubles the performance and size of next-generation models that memory can support while maintaining high accuracy.
这里提到的microscaling format应该是如下论文提出的方法,作者来自微软、AMD、英特尔、Meta、NVIDIA和高通,说明这个格式已经在业界形成了一定共识。之前FP8对整个Tensor进行统一scale,但是Tensor level Scaling被证明对于低于8bit的格式来说是不够的,因为这些格式的动态范围有限。所以缩放粒度要下降到更小的的sub-blocks of tensors,也就是Microscaling Data Formats。
Microscaling Data Formats for Deep Learning
另外,跨机的集合通信也可以用SHARP协议支持FP8。
The NVIDIA NVLink Switch Chip enables 130TB/s of GPU bandwidth in one 72-GPU NVLink domain (NVL72) and delivers 4X bandwidth efficiency with NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ FP8 support. The NVIDIA NVLink Switch Chip supports clusters beyond a single server at the same impressive 1.8TB/s interconnect. Multi-server clusters with NVLink scale GPU communications in balance with the increased computing, so NVL72 can support 9X the GPU throughput than a single eight-GPU system.
使用FP8梯度来做通信,这个之前MSRA的人已经做了一些验证。
Transformer Engine 2.0会支持FP4,期待一下第一个用FP4训练出来的MoE大模型。
评论区有人指出参数还是存储成高精度的。这里在补充一下混合精度训练概念:矩阵乘法的计算虽然用低精度数据格式完成,但是梯度计算结果会累加到高精度的参数上,有的非矩阵乘法计算还是会用高精度。具体细节可以参考我上面的知乎文章。
编辑于 2024-03-19 16:54・IP 属地上海