-
Notifications
You must be signed in to change notification settings - Fork 242
Description
您好,非常不好意思打扰您,能否抽空帮忙看下这个问题
问题运行train.sh脚本报错:
尝试打印dtype
def training_step( 加入打印代码
/transformers/trainer.py", line 2548
Input tensor input_ids dtype: torch.int64
Input tensor attention_mask dtype: torch.int64
Input tensor labels dtype: torch.int64
Loss tensor dtype: torch.float32
TypeError:output tensor must have the same type as input tensor
包介绍:
torch 2.6.0 python 3.11.11
显卡:4*nvidia 4090
脚本:
--quantization_bit 4 \ zeros去掉 因为报错
deepspeed --num_gpus 4 dbgpt_hub_sql/train/sft_train.py
--deepspeed dbgpt_hub_sql/configs/ds_config_stage3.json
--model_name_or_path codellama/CodeLlama-7B-Instruct-hf
--do_train
--dataset example_text2sql_train
--max_source_length 1024
--max_target_length 512
--template llama2
--finetuning_type lora
--lora_rank 64
--lora_alpha 32
--lora_target q_proj,v_proj
--output_dir dbgpt_hub_sql/output/adapter/llama2-13b-qlora_1024_epoch1_debug1008_withDeepseed_mulitCard
--overwrite_cache
--overwrite_output_dir
--per_device_train_batch_size 1
--gradient_accumulation_steps 16
--lr_scheduler_type cosine_with_restarts
--logging_steps 25
--save_steps 20
--learning_rate 2e-4
--num_train_epochs 0.1
--plot_loss
--bf16 2>&1 | tee ${train_log}
报错代码:
File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train
return **inner_training_loop**(----->
File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
----> tr_loss_step = **self.training_step**(model, inputs, num_items_in_batch)
File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/transformers/trainer.py", line 3745, in training_step
**self.accelerator.backward**(loss, **kwargs)
File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/accelerate/accelerator.py", line 2321, in backward
**self.deepspeed_engine_wrapped.backward**(loss, **kwargs)
File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 275, in backward
**self.engine.step()**
File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2249, in step
**self._take_model_step(lr_kwargs)**
**中间省略**
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1530, in _all_gather
[rank0]: **self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)**
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1840, in _allgather_params_coalesced
[rank0]: h = dist.all_gather_into_tensor(allgather_params[param_idx],
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank0]: return **cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)**
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 220, in all_gather_into_tensor
[rank0]: return **self.all_gather_function(output_tensor=output_tensor,**
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: **return func(*args, **kwargs)**
[rank0]: File "/home/desir/anaconda3/envs/dbgpt_hub/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3798, in all_gather_into_tensor
[rank0]: work = **group._allgather_base(output_tensor, input_tensor, opts)**
[rank0]: **TypeError: output tensor must have the same type as input tensor**