Lazy mode is now deprecated for Roberta Large, recommended way is to use torch.compile

Habana AI org

@astachowicz When running

PT_HPU_LAZY_MODE=0 python run_qa.py \
  --model_name_or_path roberta-large \
  --gaudi_config_name Habana/roberta-large \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --output_dir /tmp/squad/ \
  --use_habana \
  --torch_compile_backend hpu_backend \
  --torch_compile \
  --use_lazy_mode false \
  --throughput_warmup_steps 3 \
  --bf16

I get the following error during training:

Traceback (most recent call last):
  File "/root/workspace/optimum-habana/examples/question-answering/run_qa.py", line 732, in <module>
    main()
  File "/root/workspace/optimum-habana/examples/question-answering/run_qa.py", line 678, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 545, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 910, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 167, in send_to_device
    {
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 168, in <dictcomp>
    k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 186, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
synNodeCreateWithId failed for node: identity with synStatus 1 [Invalid argument]. .
[Rank:0] Habana exception raised from add_node at graph.cpp:481
Habana AI org

This was on Gaudi2 with SynapseAI v1.16.0

I've got the same error if I leave /tmpt/squad with different model checkpoint. If I remove /tmp/squad directory the error is gone.

Habana AI org

LGTM!

regisss changed pull request status to merged

Sign up or log in to comment