When using data parallel (DP, using fully_shard here) and tensor parallel (TP), if TP is applied to only a subset of layers such that some have only DP applied, the DP only parameters for the same DP ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results