If you want to run your training code with ‘accelerate‘ fp8, you need to install ‘transformer_engine‘ or ‘MS-AMP‘. But these two packages are hard to install beccause they depends on specific CUDA/CUDNN versions. After one afternoon’s efforet, I finally gave up and started to directly using docker image ‘nvcr.io/nvidia/pytorch:24.04-py3’.
docker run \ --gpus all \ -it \ --rm \ --shm-size="16g" \ --network host \ nvcr.io/nvidia/pytorch:24.04-py3
After enter the container by using above command, I still need to install ‘accelerate’ directly by using ‘python3 -m pip install accelerate’. In the ‘accelerate config’, I set to use ‘fp8’ with ‘E4M3’. But the training process reported error about LayerNorm. Then I manually modify the code (may not be correct but it works):
# transformer_engine/pytorch/module/layernorm.py class _LayerNorm(torch.autograd.Function): """functional LayerNorm""" @staticmethod def forward( ctx, inp: torch.Tensor, ln_weight: torch.Tensor, ln_bias: torch.Tensor, eps: float, fwd_ln_sm_margin: int, bwd_ln_sm_margin: int, zero_centered_gamma: bool, is_grad_enabled: bool, activation_dtype: torch.dtype, ) -> torch.Tensor: # Make sure input dimensions are compatible in_features = ln_weight.numel() assert inp.is_cuda, "TransformerEngine needs CUDA." permute = False if inp.shape[-1] != in_features: inp = inp.permute(0, 2, 3, 1) permute = True assert inp.shape[-1] == in_features, "LayerNorm not possible" if permute: inp = inp.permute(0, 3, 1, 2) inputmat = inp.reshape((-1, in_features)) # Cast for native AMP inputmat = cast_if_needed(inputmat, activation_dtype) ln_weight = cast_if_needed(ln_weight, activation_dtype) ln_bias = cast_if_needed(ln_bias, activation_dtype) if is_grad_enabled: ln_out, mu, rsigma = tex.layernorm_fwd(inputmat, ln_weight, ln_bias, eps, fwd_ln_sm_margin, zero_centered_gamma) ctx.save_for_backward(inputmat, ln_weight, mu, rsigma) ctx.inp_shape = inp.shape ctx.bwd_ln_sm_margin = bwd_ln_sm_margin ctx.zero_centered_gamma = zero_centered_gamma else: ln_out, mu, rsigma = layernorm_fwd_inf(inputmat, ln_weight, ln_bias, eps, zero_centered_gamma), None, None return ln_out.view_as(inp)
Finally the training could work properly. But the speed is the same with bf16…