If you want to run your training code with ‘accelerate‘ fp8, you need to install ‘transformer_engine‘ or ‘MS-AMP‘. But these two packages are hard to install beccause they depends on specific CUDA/CUDNN versions. After one afternoon’s efforet, I finally gave up and started to directly using docker image ‘nvcr.io/nvidia/pytorch:24.04-py3’.

docker run \
  --gpus all \
  -it \
  --rm \
  --shm-size="16g" \
  --network host \
  nvcr.io/nvidia/pytorch:24.04-py3

After enter the container by using above command, I still need to install ‘accelerate’ directly by using ‘python3 -m pip install accelerate’. In the ‘accelerate config’, I set to use ‘fp8’ with ‘E4M3’. But the training process reported error about LayerNorm. Then I manually modify the code (may not be correct but it works):

# transformer_engine/pytorch/module/layernorm.py

class _LayerNorm(torch.autograd.Function):
    """functional LayerNorm"""

    @staticmethod
    def forward(
        ctx,
        inp: torch.Tensor,
        ln_weight: torch.Tensor,
        ln_bias: torch.Tensor,
        eps: float,
        fwd_ln_sm_margin: int,
        bwd_ln_sm_margin: int,
        zero_centered_gamma: bool,
        is_grad_enabled: bool,
        activation_dtype: torch.dtype,
    ) -> torch.Tensor:
        # Make sure input dimensions are compatible
        in_features = ln_weight.numel()
        assert inp.is_cuda, "TransformerEngine needs CUDA."
        permute = False
        if inp.shape[-1] != in_features:
            inp = inp.permute(0, 2, 3, 1)
            permute = True
        assert inp.shape[-1] == in_features, "LayerNorm not possible"
        if permute:
            inp = inp.permute(0, 3, 1, 2)
        inputmat = inp.reshape((-1, in_features))

        # Cast for native AMP
        inputmat = cast_if_needed(inputmat, activation_dtype)
        ln_weight = cast_if_needed(ln_weight, activation_dtype)
        ln_bias = cast_if_needed(ln_bias, activation_dtype)

        if is_grad_enabled:
            ln_out, mu, rsigma = tex.layernorm_fwd(inputmat, ln_weight,
                ln_bias, eps, fwd_ln_sm_margin, zero_centered_gamma)
            ctx.save_for_backward(inputmat, ln_weight, mu, rsigma)
            ctx.inp_shape = inp.shape
            ctx.bwd_ln_sm_margin = bwd_ln_sm_margin
            ctx.zero_centered_gamma = zero_centered_gamma
        else:
            ln_out, mu, rsigma = layernorm_fwd_inf(inputmat, ln_weight,
                ln_bias, eps, zero_centered_gamma), None, None
        return ln_out.view_as(inp)

Finally the training could work properly. But the speed is the same with bf16…