Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError #111

Open
Chao86 opened this issue Feb 6, 2023 · 7 comments
Open

RuntimeError #111

Chao86 opened this issue Feb 6, 2023 · 7 comments

Comments

@Chao86
Copy link

Chao86 commented Feb 6, 2023

Hi, @FabianIsensee ,
I'm using the example "multithreaded_with_batches.ipynb" to generate my own batch data, however , the RuntimeError as follow:
image
appeared.
Can you offer me some hint to sovle this?

@FabianIsensee
Copy link
Member

Hi, there must be another error message somewhere in your output. Can you look for it?

@vcasellesb
Copy link

Hi!

Apologies Fabian and Chao for invading this issue, but I am having a similar issue than yours and maybe I can clarify the error Chao was getting.

For context, I was trying to run nnUNet with modified code. I changed the max_num_epochs from 1000 to 400, and the lr threshold from 1e-6 to 5e-3. Furthermore, I was getting stuck at validation so I implemented the change mentioned in MIC-DKFZ/nnUNet#902. As you can see, I commented the original code and substituted it by the following (line 662 of nnUNetTrainer):

                # changed by vicent 09/03/22 to speed up validation according to github issue #902
                # results.append(export_pool.starmap_async(save_segmentation_nifti_from_softmax,
                #                                          ((softmax_pred, join(output_folder, fname + ".nii.gz"),
                #                                            properties, interpolation_order, self.regions_class_order,
                #                                            None, None,
                #                                            softmax_fname, None, force_separate_z,
                #                                            interpolation_order_z),
                #                                           )
                #                                          )
                #                )

                save_segmentation_nifti_from_softmax(softmax_pred, join(output_folder, fname + ".nii.gz"),
                                                     properties, interpolation_order, self.regions_class_order,
                                                     None, None,
                                                     softmax_fname, None, force_separate_z,
                                                     interpolation_order_z)

I don't believe that this is the problem though, since my error happens at the very beginning of training, and this change I believe it mainly affects validation.

Anyways, this is the error message I got. As you can see, I think that there is no useful message to understand what is going on, only that the exception is happening in thread 4.

loading dataset
loading all case properties
unpacking dataset
done
2023-03-09 21:32:24.056818: lr: 0.01
using pin_memory on device 0
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
    raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT: "
RuntimeError: Abort event was set. So someone died and we should end this madness. 
IMPORTANT: This is not the actual error message! Look further up to see what caused the error. Please also check whether your RAM was full
Traceback (most recent call last):
  File "/home/vcaselles/anaconda3/envs/dents/bin/nnUNet_train", line 8, in <module>
    sys.exit(main())
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/run/run_training.py", line 180, in main
    trainer.run_training()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainerV2_epoch400_lr_thr_0005.py", line 441, in run_training
    ret = super().run_training()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainer_modvalidation.py", line 317, in run_training
    super(nnUNetTrainer_modvalidation, self).run_training()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/network_trainer.py", line 418, in run_training
    _ = self.tr_gen.next()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 182, in next
    return self.__next__()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 206, in __next__
    item = self.__get_next_item()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 190, in __get_next_item
    raise RuntimeError("MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of "
RuntimeError: MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of your workers crashed. This is not the actual error message! Look further up your stdout to see what caused the error. Please also check whether your RAM was full

Thank you very much for your attention, and apologies for the long and dense message, I hope I was clear enough.

Best regards,

Vicent Caselles

PS: To get the run_training.py function to work, I had to also change main() to also accept my modified trainer in the workflow. I don't think that might be the issue though.

@FabianIsensee
Copy link
Member

Is the all the text output you got? Can you please share everything? Usually there is an error message hidden somewhere

@FabianIsensee
Copy link
Member

Why not just use nnUNet_train?

@vcasellesb
Copy link

Hi Fabian, thank you very much for your response. Regarding your questions:

  1. Yes, that was all the error output I got, unfortunately
  2. I created a new custom nnUnet_trainer class with my custom max_num_epochs and lr threshold, both defined in the init function inside the class. Did I make a mistake doing that??

I honestly think that the issue raising the error was a lack of RAM, since I was using a server with a great GPU but terrible RAM (~2 GB or so), so odds are that that was the issue.

Thanks again for your time!!

Vicent Caselles

@FabianIsensee
Copy link
Member

Yeah sounds like it. Are you certain about 2GB? That's year 2000 level of RAM

@vcasellesb
Copy link

Yes, it was the cheapest AWS server with CUDA... It was 4 GB tops.

Vicent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants