Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor issues and questions running code (salsa+salsa_lite) #2

Closed
andres-fr opened this issue Feb 1, 2022 · 32 comments
Closed

Minor issues and questions running code (salsa+salsa_lite) #2

andres-fr opened this issue Feb 1, 2022 · 32 comments
Assignees
Labels
enhancement New feature or request

Comments

@andres-fr
Copy link
Contributor

andres-fr commented Feb 1, 2022

Hi! many congratulations for this outstanding line of work, and thanks a lot for sharing it.

I am running this repo on Ubuntu20.04+CUDA, and gathered a few notes on the process, with the hope that they are helpful to others.
I also encountered a few minor issues, and I also have some open questions that I couldn't answer reading the paper&docs, I was wondering if someone could take a look at them.

As for the changes I propose, I'll be happy to submit a PR if appropriate.
Cheers!
Andres


Installation:

Although mentioned in the README, there is no pip-compatible requirements.txt file, and the requirements.yml imposes more constraints than needed. The following minimal list worked for me:

# requirements.txt
scipy==1.5.2
pandas==1.1.3
scikit-learn==0.23.2
h5py==2.10.0
librosa==0.8.0
tqdm==4.54.1
torch==1.7.0+cu110
torchvision==0.8.1+cu110
pytorch-lightning==1.1.6
tensorboardx==2.1
pyyaml==5.3.1
munch==2.5.0
fire==0.3.1
ipython==7.19.0

Then, the environment can be initialized as follows (inside of <REPO_ROOT>):

conda create -n salsa python=3.7
conda activate salsa
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Precomputing SALSA features:

The readme specifies that the dataset should be found inside of <REPO_ROOT>/dataset/data. For that reason, we can get rid of the absolute paths in config files and replace them with the following relative paths. In tnsse2021_salsa_feature_config.yml:

data_dir: 'dataset/data'
feature_dir: 'dataset/features'

Then, running make salsa from <REPO_ROOT> (with the env activated) works perfectly, and yields results inside <REPO_ROOT>/dataset/features. In my case, both data and features were softlinks to an external memory drive, and it still worked fine.

Computing the SALSA features for the 600 MIC wav files (1 minute each, 4 channels, 24kHz, 6.9GB total) on an [email protected] CPU took ca. 35 minutes and 21.5GB with the default settings:

  n_fft: 512
  win_len: 512
  hop_len: 300  # 300 for 12.5ms for n_fft = 512; 150 for n_fft = 256
  fmin_doa: 50
  fmax_doa: 4000  # 'foa': 9000; 'mic': 4000

Precomputing SALSA-Lite features:

Analogous remarks as with SALSA. Computation took 2 minutes and 20.5GB.
Here, the question is how does the SALSA-Lite dedicated repo interact with this one. Will both be maintained, or is this the "main" one and the other was for publication purposes?


Training:

Question: Any pretrained models available? I couldn't find any upon a brief search

Regarding config, here we can also replace absolute with relative paths:

feature_root_dir: 'dataset/features/salsa/mic/24000fs_512nfft_300nhop_5cond_4000fmaxdoa'  # for SALSA
feature_root_dir: 'dataset/features/salsa_lite/mic/24000fs_512nfft_300nhop_2000fmaxdoa'  # for SALSA-Lite
gt_meta_root_dir: 'dataset/data'

Training setup had a couple of minor issues:

In the README we can currently see the following instruction:

For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder.

This should be updated, because the training script also expects the metadata .csv files to be on the outer folder, so those have to be moved as well. Otherwise we get a file not found error.

Side note for further readers: The "train/val split" information gets lost when mixing all files, but the repo actually has this information in the form of CSV files, stored at dataset/meta/dcase2021/original. So mixing is fine; still, it is probably not a bad idea to make a backup of the original dev metadata before mixing everything together (it is not very large).

As for make train, it is currently hardcoded to salsa, the instructions to train on salsa-lite didn't work for me. I changed the "Training and inference" section in Makefile to the following, so that we can train on both via either make train-salsa or make train-salsa-lite:

# Training and inference
CONFIG_DIR=./experiments/configs
OUTPUT=./outputs   # Directory to save output
EXP_SUFFIX=_test   # the experiment name = CONFIG_NAME + EXP_SUFFIX
RESUME=False
GPU_NUM=0  # Set to -1 if there is no GPU

.phony: train-salsa
train-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa
inference-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

.phony: train-salsa-lite
train-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa-lite
inference-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

After a few epochs, the models seemed to converge well, so I believe all the above modifications were successful. Let me know if I am missing something!

@kwatcharasupat kwatcharasupat added the enhancement New feature or request label Feb 3, 2022
@kwatcharasupat kwatcharasupat pinned this issue Feb 3, 2022
@thomeou
Copy link
Owner

thomeou commented Feb 22, 2022

Hi Andres,

Thank you very much for your detailed note on running the code in this repo. I am terribly sorry for reply you this late.

For SALSA-lite, the other repo was for publication purposes. This current repo will be used to maintain both salsa and salsa-lite.

For the relative path. I should have put train.py, inference.py, and evaluate.py outside experiments folder. Since I put them inside experiments folder, if I used relative path, I got some errors when using VScode to run make file but have no problem when using Pycharm. Which IDE are you using?

For the installation, I exported my current anaconda environment and it was bloated as I kept on installing more packages ^^. Thanks for fixing it.

For the pretrained models, do you want to upload any of your models. My previously trained models have some extra components that I experimented but do not seem to help. I will try to upload a pretrained model soon.

We are happy to receive your PR to merge the changes that you have addressed. If you do not mind, please also feel free to edit README to include the information on the running time and storage size

I am very happy that this repo seems to be useful for you. Thanks again.

@dududukekeke
Copy link

Hi! many congratulations for this outstanding line of work, and thanks a lot for sharing it.

I am running this repo on Ubuntu20.04+CUDA, and gathered a few notes on the process, with the hope that they are helpful to others. I also encountered a few minor issues, and I also have some open questions that I couldn't answer reading the paper&docs, I was wondering if someone could take a look at them.

As for the changes I propose, I'll be happy to submit a PR if appropriate. Cheers! Andres

Installation:

Although mentioned in the README, there is no pip-compatible requirements.txt file, and the requirements.yml imposes more constraints than needed. The following minimal list worked for me:

# requirements.txt
scipy==1.5.2
pandas==1.1.3
scikit-learn==0.23.2
h5py==2.10.0
librosa==0.8.0
tqdm==4.54.1
torch==1.7.0+cu110
torchvision==0.8.1+cu110
pytorch-lightning==1.1.6
tensorboardx==2.1
pyyaml==5.3.1
munch==2.5.0
fire==0.3.1
ipython==7.19.0

Then, the environment can be initialized as follows (inside of <REPO_ROOT>):

conda create -n salsa python=3.7
conda activate salsa
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Precomputing SALSA features:

The readme specifies that the dataset should be found inside of <REPO_ROOT>/dataset/data. For that reason, we can get rid of the absolute paths in config files and replace them with the following relative paths. In tnsse2021_salsa_feature_config.yml:

data_dir: 'dataset/data'
feature_dir: 'dataset/features'

Then, running make salsa from <REPO_ROOT> (with the env activated) works perfectly, and yields results inside <REPO_ROOT>/dataset/features. In my case, both data and features were softlinks to an external memory drive, and it still worked fine.

Computing the SALSA features for the 600 MIC wav files (1 minute each, 4 channels, 24kHz, 6.9GB total) on an [email protected] CPU took ca. 35 minutes and 21.5GB with the default settings:

  n_fft: 512
  win_len: 512
  hop_len: 300  # 300 for 12.5ms for n_fft = 512; 150 for n_fft = 256
  fmin_doa: 50
  fmax_doa: 4000  # 'foa': 9000; 'mic': 4000

Precomputing SALSA-Lite features:

Analogous remarks as with SALSA. Computation took 2 minutes and 20.5GB. Here, the question is how does the SALSA-Lite dedicated repo interact with this one. Will both be maintained, or is this the "main" one and the other was for publication purposes?

Training:

Question: Any pretrained models available? I couldn't find any upon a brief search

Regarding config, here we can also replace absolute with relative paths:

feature_root_dir: 'dataset/features/salsa/mic/24000fs_512nfft_300nhop_5cond_4000fmaxdoa'  # for SALSA
feature_root_dir: 'dataset/features/salsa_lite/mic/24000fs_512nfft_300nhop_2000fmaxdoa'  # for SALSA-Lite
gt_meta_root_dir: 'dataset/data'

Training setup had a couple of minor issues:

In the README we can currently see the following instruction:

For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder.

This should be updated, because the training script also expects the metadata .csv files to be on the outer folder, so those have to be moved as well. Otherwise we get a file not found error.

Side note for further readers: The "train/val split" information gets lost when mixing all files, but the repo actually has this information in the form of CSV files, stored at dataset/meta/dcase2021/original. So mixing is fine; still, it is probably not a bad idea to make a backup of the original dev metadata before mixing everything together (it is not very large).

As for make train, it is currently hardcoded to salsa, the instructions to train on salsa-lite didn't work for me. I changed the "Training and inference" section in Makefile to the following, so that we can train on both via either make train-salsa or make train-salsa-lite:

# Training and inference
CONFIG_DIR=./experiments/configs
OUTPUT=./outputs   # Directory to save output
EXP_SUFFIX=_test   # the experiment name = CONFIG_NAME + EXP_SUFFIX
RESUME=False
GPU_NUM=0  # Set to -1 if there is no GPU

.phony: train-salsa
train-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa
inference-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

.phony: train-salsa-lite
train-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa-lite
inference-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

Hi! many congratulations for this outstanding line of work, and thanks a lot for sharing it.

I am running this repo on Ubuntu20.04+CUDA, and gathered a few notes on the process, with the hope that they are helpful to others. I also encountered a few minor issues, and I also have some open questions that I couldn't answer reading the paper&docs, I was wondering if someone could take a look at them.

As for the changes I propose, I'll be happy to submit a PR if appropriate. Cheers! Andres

Installation:

Although mentioned in the README, there is no pip-compatible requirements.txt file, and the requirements.yml imposes more constraints than needed. The following minimal list worked for me:

# requirements.txt
scipy==1.5.2
pandas==1.1.3
scikit-learn==0.23.2
h5py==2.10.0
librosa==0.8.0
tqdm==4.54.1
torch==1.7.0+cu110
torchvision==0.8.1+cu110
pytorch-lightning==1.1.6
tensorboardx==2.1
pyyaml==5.3.1
munch==2.5.0
fire==0.3.1
ipython==7.19.0

Then, the environment can be initialized as follows (inside of <REPO_ROOT>):

conda create -n salsa python=3.7
conda activate salsa
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Precomputing SALSA features:

The readme specifies that the dataset should be found inside of <REPO_ROOT>/dataset/data. For that reason, we can get rid of the absolute paths in config files and replace them with the following relative paths. In tnsse2021_salsa_feature_config.yml:

data_dir: 'dataset/data'
feature_dir: 'dataset/features'

Then, running make salsa from <REPO_ROOT> (with the env activated) works perfectly, and yields results inside <REPO_ROOT>/dataset/features. In my case, both data and features were softlinks to an external memory drive, and it still worked fine.

Computing the SALSA features for the 600 MIC wav files (1 minute each, 4 channels, 24kHz, 6.9GB total) on an [email protected] CPU took ca. 35 minutes and 21.5GB with the default settings:

  n_fft: 512
  win_len: 512
  hop_len: 300  # 300 for 12.5ms for n_fft = 512; 150 for n_fft = 256
  fmin_doa: 50
  fmax_doa: 4000  # 'foa': 9000; 'mic': 4000

Precomputing SALSA-Lite features:

Analogous remarks as with SALSA. Computation took 2 minutes and 20.5GB. Here, the question is how does the SALSA-Lite dedicated repo interact with this one. Will both be maintained, or is this the "main" one and the other was for publication purposes?

Training:

Question: Any pretrained models available? I couldn't find any upon a brief search

Regarding config, here we can also replace absolute with relative paths:

feature_root_dir: 'dataset/features/salsa/mic/24000fs_512nfft_300nhop_5cond_4000fmaxdoa'  # for SALSA
feature_root_dir: 'dataset/features/salsa_lite/mic/24000fs_512nfft_300nhop_2000fmaxdoa'  # for SALSA-Lite
gt_meta_root_dir: 'dataset/data'

Training setup had a couple of minor issues:

In the README we can currently see the following instruction:

For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder.

This should be updated, because the training script also expects the metadata .csv files to be on the outer folder, so those have to be moved as well. Otherwise we get a file not found error.

Side note for further readers: The "train/val split" information gets lost when mixing all files, but the repo actually has this information in the form of CSV files, stored at dataset/meta/dcase2021/original. So mixing is fine; still, it is probably not a bad idea to make a backup of the original dev metadata before mixing everything together (it is not very large).

As for make train, it is currently hardcoded to salsa, the instructions to train on salsa-lite didn't work for me. I changed the "Training and inference" section in Makefile to the following, so that we can train on both via either make train-salsa or make train-salsa-lite:

# Training and inference
CONFIG_DIR=./experiments/configs
OUTPUT=./outputs   # Directory to save output
EXP_SUFFIX=_test   # the experiment name = CONFIG_NAME + EXP_SUFFIX
RESUME=False
GPU_NUM=0  # Set to -1 if there is no GPU

.phony: train-salsa
train-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa
inference-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

.phony: train-salsa-lite
train-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa-lite
inference-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

After a few epochs, the models seemed to converge well, so I believe all the above modifications were successful. Let me know if I am missing something!

Hi Andres, when I use this instruction:pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html,I'm going to get an incorrect instruction like this:ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'. Coud you tell me how to solve it? Thank you very much!

@kwatcharasupat
Copy link
Collaborator

Hi @xieyin666, it looks like you might not be in the correct directory. Please make sure you are in the directory where the requirement file is present, or you can call pip install -r path/to/requirements.txt instead.

@dududukekeke
Copy link

Hi @xieyin666, it looks like you might not be in the correct directory. Please make sure you are in the directory where the requirement file is present, or you can call pip install -r path/to/requirements.txt instead.

Thank you for your reply. I just downloaded your original code, but after unzipping, file requirements.txt is not in it.

@Peter-72
Copy link

Hi @xieyin666, it looks like you might not be in the correct directory. Please make sure you are in the directory where the requirement file is present, or you can call pip install -r path/to/requirements.txt instead.

Thank you for your reply. I just downloaded your original code, but after unzipping, file requirements.txt is not in it.

That is because there is not "requirements.txt". As I understand, you have to create one yourself, and add in it the dependencies that @andres-fr mentioned that they worked for him.

@andres-fr
Copy link
Contributor Author

@xieyin666 @karnwatcharasupat @Peter-72
Hi everyone! Sorry I wanted to do a PR but I forgot, will try to do it in the next few weeks. For the moment, Peter is right, you need to copy requirements.txt from my previous post, let me know by mentioning me with @ if there are any further issues and you want my input (otherwise I dont get notified) Cheers!

@dududukekeke
Copy link

Hi @xieyin666, it looks like you might not be in the correct directory. Please make sure you are in the directory where the requirement file is present, or you can call pip install -r path/to/requirements.txt instead.

Thank you for your reply. I just downloaded your original code, but after unzipping, file requirements.txt is not in it.

That is because there is not "requirements.txt". As I understand, you have to create one yourself, and add in it the dependencies that @andres-fr mentioned that they worked for him.

ok ,thank you!

@dududukekeke
Copy link

@xieyin666 @karnwatcharasupat @Peter-72 Hi everyone! Sorry I wanted to do a PR but I forgot, will try to do it in the next few weeks. For the moment, Peter is right, you need to copy requirements.txt from my previous post, let me know by mentioning me with @ if there are any further issues and you want my input (otherwise I dont get notified) Cheers!

@andres-fr
Hi,Thank you so much for sharing! I do have a question that I need your help with. What code do I need to change when I choose multi-GPUS training?

@andres-fr
Copy link
Contributor Author

What code do I need to change when I choose multi-GPUS training?

@xieyin666 I believe that is an enhancement that is going to require some effort and can't be addressed here.

In this issue #4 I have provided code for on-the-fly, parallelized computation. You could use that as basis, and then check e.g. the DataParallel API in PyTorch, or Ray, to distrubute the batches across GPUs.

If you manage to do it, it would be great if you can share your process and results in a separate issue. Cheers

@Peter-72
Copy link

Peter-72 commented Apr 25, 2022

Hi @thomeou and @karnwatcharasupat,

I would like to congratulate and thank you of your amazing work on this project. I am a computer science student at The German University in Cairo. Currently, I am in the bachelor semester, and my bachelor project is one of the approaches to solve the dcase2021 named "Spectrotemporally-Aligned Features and Long Short-Term Memory for Sound Event Localization and Detection", which is yours. My work should be understanding challenge, your approach, and if possible, help improve it. As I am new to the world of deep learning and pytorch, I am having problems running/training the model. Thanks to @andres-fr I progressed a lot in setting the stage for doing so, but, still, I am not able to run the model yet. So, here is what I have done, and please help filling the gaps if possible:

  1. Applied the following commands:
conda create -n salsa python=3.7
conda activate salsa
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
  1. For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder foa_dev or mic_dev

    and copied the dev.csv, test.csv, train.csv, and val.csv to the foa_dev. @andres-fr I am not sure that I understood that correctly; would you please confirm?

  2. The readme specifies that the dataset should be found inside of <REPO_ROOT>/dataset/data. For that reason, we can get rid of the absolute paths in config files and replace them with the following relative paths. In tnsse2021_salsa_feature_config.yml:

data_dir: 'dataset/data'
feature_dir: 'dataset/features'
  1. Regarding config, here we can also replace absolute with relative paths:

feature_root_dir: 'dataset/features/salsa/mic/24000fs_512nfft_300nhop_5cond_4000fmaxdoa'  # for SALSA
feature_root_dir: 'dataset/features/salsa_lite/mic/24000fs_512nfft_300nhop_2000fmaxdoa'  # for SALSA-Lite
gt_meta_root_dir: 'dataset/data'
  1. @andres-fr

As for make train, it is currently hardcoded to salsa, the instructions to train on salsa-lite didn't work for me. I changed the "Training and inference" section in Makefile to the following, so that we can train on both via either make train-salsa or make train-salsa-lite:

# Training and inference
CONFIG_DIR=./experiments/configs
OUTPUT=./outputs   # Directory to save output
EXP_SUFFIX=_test   # the experiment name = CONFIG_NAME + EXP_SUFFIX
RESUME=False
GPU_NUM=0  # Set to -1 if there is no GPU

.phony: train-salsa
train-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa
inference-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

.phony: train-salsa-lite
train-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa-lite
inference-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

I understand that this a useful option, but I don't understand should I replace those lines with all the file or just a specific part?

These are the things I did. After this, as I understand, I should execute the make salsa command, however I don't know where to do so. I tried running it on the vscode terminal in the root folder, but it doesn't understand what is "make". Also, I want to ask how much vram will it consume? I have a 6 GB NVIDIA RTX 2060 mobile GPU; if my vram isn't enough, what parameters should I change in the code to decrease the vram usage? And what will be the impact of doing so?

That's it.
Looking forward for your replies.
Thanks in advance.

@dududukekeke
Copy link

These are the things I did. After this, as I understand, I should execute the make salsa command, however I don't know where to do so. I tried running it on the vscode terminal in the root folder, but it doesn't understand what is "make". Also, I want to ask how much vram will it consume? I have a 6 GB NVIDIA RTX 2060 mobile GPU; if my vram isn't enough, what parameters should I change in the code to decrease the vram usage? And what will be the impact of doing so?

@andres-fr
This is really difficult, Pytorch-lightning is a new version and there is very little information available online.

@dududukekeke
Copy link

experiments/train.py --exp_config=

@Peter-72 Everything you've done should be right.However, I haven't found a better way to solve the problem that GPU memory may not be enough. All I can think of so far is using multi-card training, but I don't know how to change the code. (for Pytorch-lightning code)
@andres-fr @karnwatcharasupat @thomeou I'd like to hear your ideas, too.

@andres-fr
Copy link
Contributor Author

experiments/train.py --exp_config=

@Peter-72 Everything you've done should be right.However, I haven't found a better way to solve the problem that GPU memory may not be enough. All I can think of so far is using multi-card training, but I don't know how to change the code. (for Pytorch-lightning code)
@andres-fr @karnwatcharasupat @thomeou I'd like to hear your ideas, too.

Try a smaller batch size

@Peter-72
Copy link

@karnwatcharasupat , @thomeou , @andres-fr , @xieyin666
Hi all,

  1. I noticed that there are more directories that no one mentioned that they should change. There are output_dir and exp_group_dir that have the author's personal paths. Shouldn't they be changed?
  2. There is this problem that I have been trying to fix for a while, but I couldn't at all. Everytime I issue a make command, it fails, and says:
$make train
process_begin: CreateProcess(NULL, pwd, ...) failed.
PYTHONPATH= CUDA_VISIBLE_DEVICES="0  " python experiments/train.py --exp_config="./experiments/configs/seld.yml" --exp_group_dir=./outputs    --exp_suffix=_test    --resume=False
'PYTHONPATH' is not recognized as an internal or external command,
operable program or batch file.
make: *** [train] Error 1

Each make has a slightly different message in terms of the PYTHONPATH= CUDA_VISIBLE_DEVICES="0 " ... etc, but always ends with the

'PYTHONPATH' is not recognized as an internal or external command, operable program or batch file.
make: *** [train] Error 1

I have checked everything in my settings including: the environmental variables and vscode python interpreter path, and both lead to the salsa env created for the project. I have run out of solutions; any help please? Also, is there something in the code that targets this issue?

@andres-fr
Copy link
Contributor Author

@Peter-72 try removing the space after PYTHONPATH=. Sorry i dont have a computer atm

@Peter-72
Copy link

@Peter-72 try removing the space after PYTHONPATH=. Sorry i dont have a computer atm

@andres-fr
No problem.
Unfortunatly, it didn't work

process_begin: CreateProcess(NULL, pwd, ...) failed.
PYTHONPATH=CUDA_VISIBLE_DEVICES="0  " python experiments/train.py --exp_config="./experiments/configs/seld.yml" --exp_group_dir=./outputs    --exp_suffix=_test    --resume=False
'PYTHONPATH' is not recognized as an internal or external command,
operable program or batch file.
make: *** [train] Error 1

@Peter-72
Copy link

@Peter-72 try removing the space after PYTHONPATH=. Sorry i dont have a computer atm

@andres-fr No problem. Unfortunatly, it didn't work

process_begin: CreateProcess(NULL, pwd, ...) failed.
PYTHONPATH=CUDA_VISIBLE_DEVICES="0  " python experiments/train.py --exp_config="./experiments/configs/seld.yml" --exp_group_dir=./outputs    --exp_suffix=_test    --resume=False
'PYTHONPATH' is not recognized as an internal or external command,
operable program or batch file.
make: *** [train] Error 1

@andres-fr @thomeou

I switched to pycharm as many people online suggested, but the problem still persist.

@andres-fr
Copy link
Contributor Author

Dear @Peter-72 ,

To your original questions

  1. I don't remember well, maybe i didn't test all corner cases, but I got it to train

  2. You can add my 4 phony entries to the makefile, they are purposedly named differently. Then use them instead of the others. Make is a relevant tool, try to install it

To reduce gpu ram usage, reduce batch size in the config files (look for conf inside the repo).

Again, sorry no pc here. Cheers!

@dududukekeke
Copy link

@andres-fr @thomeou @karnwatcharasupat @Peter-72
Hi, the result of my replicating code is not stable, up and down a lot. Do you have the same situation?

@dududukekeke
Copy link

@andres-fr The author's code was used to train three times in a row and test the best model for three times, but the values of the three groups of results were very different. That is, the model is not stable.

@Peter-72
Copy link

@thomeou @karnwatcharasupat @andres-fr
When training the model using the mic files, the training gets killed after 3-5 mins. When opening the task manager, I can see that the model is eating up all the ram (surprisingly not the gpu ram). However, I don't think that this is the problem; the model uses around 90-98% (around 12 GBs of memory) of my 16 GBs of memory (this range of percentage includes the usage of my normal programs' usage), but it never reaches the 100% to be the reason that kills the process. I tried to change, gradually, the train_batch_size and val_batch_size variables in the seld.yml file from 32 all the way down to 5, hoping that it will work but it didn't. As for the GPU usage, it may spike from 0% to ~40% for a second, and returns down to 0. Can't identify the problem. Here is the terminal:

$ make train
PYTHONPATH=/mnt/d/GUC/Semester-8/SALSA CUDA_VISIBLE_DEVICES="0  " python experiments/train.py --exp_config="./experiments/configs/seld.yml" --exp_group_dir=./outputs    --exp_suffix=_test    --resume=False
**********************************************************
****** Start new experiment ******************************
**********************************************************

Timestamp: 2022-05-11-20-41-52
Log file is created in ./outputs/crossval/mic/salsa/seld_test/logs.
Write yaml config file to ./outputs/crossval/mic/salsa/seld_test/configs
Finish parsing config file: ./experiments/configs/seld.yml.
Global seed set to 2021
Load feature from dataset/features/salsa/mic/24000fs_512nfft_300nhop_5cond_4000fmaxdoa.
train_chunk_len = 640, train_chunk_hop_len = 40
test_chunk_len = 4800, test_chunk_hop_len = 4808
Create DataModule using train val split at dataset/meta/dcase2021/original.
Number of files in split train is 400
Killed
make: *** [Makefile:38: train] Error 137

Notice that the numbers don't change how many times I change the chunk size.

@andres-fr
Copy link
Contributor Author

Hi @Peter-72

Glad to see you made it to training! Maybe the others have some idea but let me give you my 2c:

What you describe sounds like a memory leak: in Python we usually don't worry about memory management because the garbage collector takes care of "removing" unused objects, but leaks that lead to RAM bloating still can and do happen mainly when our code is constantly creating new objects, and keeping references to them even after they aren't needed anymore (e.g. some inner variable in a for loop is being added to a datastructure).

This happens mainly through 2 patterns: if we do it explicitly by using deep recursion, or collecting some datastructure (rather unusual), or if we use some library that implicitly collects our computations into a global memory scope, without telling us. Whenever I have leaks, that is usually the case, e.g. if you are generating large matplotlib renderings using the global context (things like plt.plot, plt.imshow), old plots will end up being stored in RAM and your garbage collector can't get rid of them. See here.

Maybe the others can chip in and have a better idea (since I didn't write the orignal code), but to fix a leak I'd comment out the whole loop and then add/remove parts of the code until you find the ones that lead to RAM bloating.

Cheers!
Andres

@andres-fr
Copy link
Contributor Author

@Peter-72 After more careful looking, it could be that you just don't have enough RAM.
Depending on your OS and settings, when you get close to 100% usage, some systems re-allocate some stuff to the so-called "swap" region of the persistent memory, precisely to avoid running totally out of RAM. For that, the OS has to guess which things are less necessary, since persistent memory is slower than RAM and you don't want to be constantly accessing it.

If I recall correctly, the script tries to load all 400 chunks at the same time, which would mean you need to do some fundamental changes in the code and/or here

@Peter-72
Copy link

Peter-72 commented May 11, 2022

@andres-fr Hey man, thanks for replying.
Yes, I came a long way.
I like your printing suggestion. I did it, and found the line responsible for that. In the database.py from line 231 starting from here:

if len(features_list) > 0:
            features = np.concatenate(features_list, axis=1)
            sed_targets = np.concatenate(sed_targets_list, axis=0)
            doa_targets = np.concatenate(doa_targets_list, axis=0)
            sed_chunk_idxes = np.concatenate(feature_idxes_list, axis=0)
            gt_chunk_idxes = np.concatenate(gt_idxes_list, axis=0)
            test_batch_size = len(feature_idxes)  # to load all chunks of the same file
            return features, sed_targets, doa_targets, sed_chunk_idxes, gt_chunk_idxes, filename_list, test_batch_size
        else:
            return None, None, None, None, None, None, None

I made sure it reached that condition. But after getting inside the condition, it stays and keeps computing this line:
features = np.concatenate(features_list, axis=1) until it's killed. Now what? I understand that concatination can be disastrous, but does it need that much ram? How much ram do you guys have in the first place? @thomeou @karnwatcharasupat

@kwatcharasupat
Copy link
Collaborator

Hi @Peter-72, @dududukekeke, would you mind creating separate issues for each of your concerns so we can address them more cleanly. Thank you!

As for the memory issues by @Peter-72, our version was trained on a very large server (a few hundred GB of RAM) so we admittedly didn't quite optimize that part of the code. I will have to take a closer look and do proper optimization on it but a very ducttape fix you can try now would be to do the following:

  • create a helper function called load_chunk_data_mini or something
  • replace load_chunk_data with that function in

    SALSA/dataset/database.py

    Lines 157 to 159 in 3c03140

    features, sed_targets, doa_targets, feature_chunk_idxes, gt_chunk_idxes, filename_list, test_batch_size = \
    self.load_chunk_data(split_filenames=split_filenames, split_sed_feature_dir=split_sed_feature_dir,
    gt_meta_dir=gt_meta_dir)
  • within load_chunk_data_mini, break split_filenames into a few smaller list and call load_chunk_data on each of them.
  • call .contiguous on those arrays
  • concat the outputs of those minichunks

This should help to reduce to RAM requirement during the concat ops. I will write a more proper edit to that code when I have more time.

@andres-fr
Copy link
Contributor Author

Shameless plug:

Whenever you are dealing with large amounts of numerical data that don't fit in RAM, the HDF5 format is a very flexible way of storing them in persistent memory, while allowing for relatively fast read/write operations and leaving most your RAM untouched.

I developed this (IMO extremely useful) class to create and parse incremental HDF5 files in Python, which I've been using in PyTorch dataloaders on large datasets. If the "chunk size" is properly chosen, this can lead to huge speedups as compared to regularly loading from filesystem + precomputing: https://gist.github.com/andres-fr/3ed6b080adafd72ee03ed01518106a15

Cheers

@Peter-72
Copy link

Peter-72 commented May 14, 2022

Hey @karnwatcharasupat @andres-fr,

Thanks for both of your replies.

@karnwatcharasupat I have implemented the helper method that you suggested, and it worked like a charm. But I don't understand the importance of the .contiguous on the arrays; I wrote the method without it. I have also added a separate issue for Here. If acceptable, I can request a PR on a separate branch where it contains:

  1. The new helper method to minimize the the RAM usage
  2. Adjusted paths from absolute to relative
  3. Requirements.txt file added
  4. And possible in the future, a more detailed README file

@thomeou @karnwatcharasupat I want to ask about your choice of n_mels variable value in the seld.yml; isn't 200 a bit too much?

@andres-fr Cheers for your work on the class you created. If only can you tell me how to add/use it in this project :)

@kwatcharasupat
Copy link
Collaborator

@Peter-72 Thanks a lot. We can continue the discussion in #7.

With regards to n_mels, that's a little bit of a misnomer; the same key is used for both mel- and linear-frequency modes. We actually use linear-frequency log-magnitude spectrogram for SALSA. So the 200 is actually the number of linear frequency bins (not mel bins).

Anyway, you can read up more about using HDF5 here: https://docs.h5py.org/en/stable/. They have quite a decent explanation. Lots of other resources on the internet as well.

@000yl
Copy link

000yl commented Mar 23, 2023

实验/train.py --exp_config=

@Peter-72 你所做的一切都应该是正确的。但是,我还没有找到更好的方法来解决GPU内存可能不够用的问题。到目前为止我能想到的就是使用多卡训练,但我不知道如何更改代码。(对于 Pytorch-lightning 代码)
@andres-fr @karnwatcharasupat @thomeou 我也想听听你的想法。

尝试较小的批量
您解决了吗?batch_size缩小还是不行咋整

@000yl
Copy link

000yl commented Mar 23, 2023

数字不会改变我更改块大小的次数

请问您,最后是怎么解决得,修改了那些部分的代码吗 @andres-fr @Peter-72

@muuda
Copy link

muuda commented Apr 21, 2023

Hello, i have a question about how to calculate phase_vector.
Why "phase_vector = np.angle(X[:, :, 1:] * np.conj(X[:, :, 0, None]))" ?
Thank you very much! @andres-fr

@andres-fr
Copy link
Contributor Author

andres-fr commented Apr 21, 2023

Hi everyone,

If you have a specific question that hasn't been tackled yet, kindly open a new issue, making sure that you provide us with all the necessary info to help you. This may include:

  1. Make sure you read thr existing documentation (or the paper for explanations on the math)
  2. If you still have a question, please write in English since more people can help you that way
  3. Provide context for the code or elements you want to discuss (ideally with links to the impacted lines of code)
  4. If discussing a bug, provide minimal working examples

Note that these are not compulsory, you can still but vastly improve the chances that we are able to help you.

For example, regarding the question above, normalized complex numbers only differ in their angles, or "phases". And conjugating a complex number amounts to multiplying its angle by -1. Finally, multiplying complex numbers causes their angles to add up. Probably this was a way to obtain the phase differences for all different channels. But it's been a long time and without a link to the code I don't recall the context... so please follow the guidelines above for more info. In the meantime I will close this issue.

Cheers!
Andres

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants