google/gemma-2-9b · Issues with FSDP and DeepSpeed During Distributed Training for Gemma

Jul 16

I'm trying to train Gemma with LoRA using FSDP and DeepSpeed. My context size is around 7000, so it doesn't fit in my GPU, making distributed training a necessity. I noticed a strange phenomenon with FSDP: the training literally freezes, like a code freeze. It works fine without FSDP, and the same code works perfectly for LLaMA, Mistral, Mixtral, Phi, and other models with FSDP.

For DeepSpeed, it goes out of memory even with Stage 3.

{
  "zero_force_ds_cpu_optimizer": false,
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 3,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": true
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": "auto",
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Has anyone else encountered this problem during distributed training? How did you resolve it?

Andriy

Jul 19

I have the same issue. Is there a workaround for this problem?

haandol

Jul 27

same here

OSalem99

Aug 1

I noticed the same issue while using Pipeline to generate on a large number of prompts. You may need to add torch_empty_cache_steps=1 as and argument if you are using TRL trainers. Hope this helps.

haandol

Aug 2

I noticed the same issue while using Pipeline to generate on a large number of prompts. You may need to add torch_empty_cache_steps=1 as and argument if you are using TRL trainers. Hope this helps.

latest doc says there is the parameter at the SFTConfig, but the code does not have it.

anandhperumal

Aug 6

@OSalem99 But that's for the generate function; I'm trying a distributed training. Did you try distributed training?