Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetuning on 3090, is it possible? #73

Open
yfliao opened this issue Mar 17, 2023 · 2 comments
Open

finetuning on 3090, is it possible? #73

yfliao opened this issue Mar 17, 2023 · 2 comments

Comments

@yfliao
Copy link

yfliao commented Mar 17, 2023

Is it possible to finetune the 7B model using 8*3090?
I had set:

--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \

but still got OOM:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 23.70 GiB total capacity; 22.21 GiB already allocated; 127.56 MiB free; 22.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my scruptis as follows:

torchrun --nproc_per_node=4 --master_port=12345 train.py
--model_name_or_path ../llama-7b-hf
--data_path ./alpaca_data.json
--bf16 True
--output_dir ./output
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer'
--tf32 True

@Wojx
Copy link

Wojx commented Mar 20, 2023

Try use AdamW 8b optimizer from:
https://github.com/TimDettmers/bitsandbytes/tree/ec5fbf4cc44324829307138a4c17fd88dddd9803
After installation, just add flag to the script call:
--optim adamw_bnb_8bit
Current Transformers version natively supports bitsandbytes.

With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB. RTX 8000 isn't a ampere GPU, so instead of bf16 and tf32 low precision , I use fp16.

@otto-dev
Copy link

With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB

That doesn't help us to fine-tune on a single 24GB RTX 3090, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants