How to TP Using Command Blocks 1.17

When using DP + TP, DP only parameters diverge across TP ranks if using operations with non-deterministic implementations

When using data parallel (DP, using fully_shard here) and tensor parallel (TP), if TP is applied to only a subset of layers such that some have only DP applied, the DP only parameters for the same DP ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

When using DP + TP, DP only parameters diverge across TP ranks if using operations with non-deterministic implementations

Trending now