Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move peft imports to avoid RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase #30

Merged
merged 2 commits into from
Mar 15, 2024

Conversation

geronimi73
Copy link
Contributor

It seems that something in autoawq causes a RuntimeError in train.py if the package (autoawq) is imported before process forking. peft, starting from 0.9 imports autoawq. This PR moves the peft imports after process forking and thereby prevents the RuntimeError with peft>=0.9.

Related issue: #28

@geronimi73
Copy link
Contributor Author

@johnowhitaker

@johnowhitaker johnowhitaker merged commit d7818ec into AnswerDotAI:main Mar 15, 2024
@johnowhitaker
Copy link
Contributor

Thank you @geronimi73 much appreciated :)

@geronimi73 geronimi73 deleted the fix_ProcessExitedException branch March 15, 2024 17:11
@iseesaw
Copy link

iseesaw commented Apr 24, 2024

Sorry, I still met this problem using the merged code

  File "/root/miniconda3/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/root/kyzhang/llms/UltraMedical/llm_train/train_qdora.py", line 1086, in main
    mp.spawn(fsdp_main,
  File "/root/miniconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/root/miniconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 193, in start_processes
    process.start()                                                                                                                                                          File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)                                                                                                                                          File "/root/miniconda3/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)                                                                                                                                                File "/root/miniconda3/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__                                                                            super().__init__(process_obj)                                                                                                                                            File "/root/miniconda3/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__                                                                                   self._launch(process_obj)                                                                                                                                                File "/root/miniconda3/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch                                                                             prep_data = spawn.get_preparation_data(process_obj._name)                                                                                                                File "/root/miniconda3/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data                                                                           _check_not_importing_main()                                                                                                                                              File "/root/miniconda3/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main                                                                      raise RuntimeError('''                                                                                                                                                 RuntimeError:                                                                                                                                                                      An attempt has been made to start a new process before the                                                                                                                 current process has finished its bootstrapping phase.                                                                                                                                                                                                                                                                                                 This probably means that you are not using fork to start your                                                                                                              child processes and you have forgotten to use the proper idiom                                                                                                             in the main module:                                                                                                                                                                                                                                                                                                                                       if __name__ == '__main__':                                                                                                                                                     freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

@geronimi73
Copy link
Contributor Author

try again after pip uninstall autoawq

@iseesaw
Copy link

iseesaw commented Apr 24, 2024

try again after pip uninstall autoawq

thanks for your response, autoawq is not installed on my server

(base) root@b575798d621b:~/kyzhang/llms/UltraMedical# pip uninstall autoawq
WARNING: Skipping autoawq as it is not installed.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

@iseesaw
Copy link

iseesaw commented Apr 24, 2024

pip list

accelerate                0.29.3
bitsandbytes              0.43.1
datasets                  2.14.6
huggingface-hub           0.20.3
llama-recipes             0.0.1
peft                      0.10.0
safetensors               0.4.2      
tokenizers                0.19.1
torch                     2.1.2
transformers              4.40.0
cupy-cuda12x              12.1.0
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105

8xA6000 48G, CUDA Version: 12.2

@geronimi73
Copy link
Contributor Author

pip list

accelerate                0.29.3
bitsandbytes              0.43.1
datasets                  2.14.6
huggingface-hub           0.20.3
llama-recipes             0.0.1
peft                      0.10.0
safetensors               0.4.2      
tokenizers                0.19.1
torch                     2.1.2
transformers              4.40.0
cupy-cuda12x              12.1.0
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105

8xA6000 48G, CUDA Version: 12.2

can't reproduce with those packages. give me the entire pip list please

@iseesaw
Copy link

iseesaw commented Apr 24, 2024

Thanks for your patience! Here is the complete list:
requirements.txt

@geronimi73
Copy link
Contributor Author

Check if this works:

in train.py, insert the following code (below) at line 1036

code to insert:

    if __name__ != '__main__':
        return

your code starting from line 1034 should look like this:

    entity: str = None, # For wandb logging
):
    if __name__ != '__main__':
        return

    # Set world size
    if world_size == -1:
        world_size = torch.cuda.device_count()
    print(f"World size: {world_size}")

then try again

@iseesaw
Copy link

iseesaw commented Apr 24, 2024

Thanks! I'll give this a try later.

@iseesaw
Copy link

iseesaw commented Apr 24, 2024

I successfully ran the code! Thank you very much. This project is wonderful!

@geronimi73
Copy link
Contributor Author

I successfully ran the code! Thank you very much. This project is wonderful!

👍

which OS are you on, windows?

@iseesaw
Copy link

iseesaw commented Apr 24, 2024

which OS are you on, windows?

Ubuntu 22.04.2 LTS in Docker

@geronimi73
Copy link
Contributor Author

@iseesaw could you please check if this runs or throws the same error:

import torch.multiprocessing as mp
from fastcore.script import call_parse

print(f"script. {__name__}")

def do_something(inp):
    print('do_something')

@call_parse
def main():
    print('main')

    mp.spawn(
        do_something, 
        nprocs=2,
        join=True
        )
    print('Finished')

@iseesaw
Copy link

iseesaw commented Apr 25, 2024

@iseesaw could you please check if this runs or throws the same error:

I tested the code, and it executed successfully without any errors. Here is the output I observed:

script. __main__
main
script. __mp_main__
do_something
script. __mp_main__
do_something
Finished

@geronimi73
Copy link
Contributor Author

i'm still trying to understand why this error happens.

are you using the original train.py from this repo or did you modify the code? are you by any chance using the HF datasets lib with import datasets (inside train.py) or something similar?

@iseesaw
Copy link

iseesaw commented Apr 25, 2024

The error may be related to the use of multiprocessing for dataset processing.

To adapt to different model chat templates, I modified the get_dataloader() function in train.py. Additionally, I've imported LazySupervisedDataset from the FastChat repository, which you can view train_with_template.py#L258 and train_with_template.py#L209.

My apologies for any confusion caused. This modification could be the source of the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants