Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the RAM consumption when preparing data for training #7

Open
Peter-72 opened this issue May 14, 2022 · 1 comment
Open
Assignees
Labels
enhancement New feature or request will-take-a-while It will likely be a while before a proper fix is made. Please do not expect an immediate fix.

Comments

@Peter-72
Copy link

Peter-72 commented May 14, 2022

The load_chunk_data method is aggressively cosuming huge amounts of RAM when concatenating np arrays.

I am currently trying to implement something that will reduce the RAM consumption

@karnwatcharasupat @thomeou I am happy to request a PR when I am done, if that's is acceptable by you.

PS: I noticed that the previous method never worked, and I apologize for not properly testing it; I am trying something new now.

@karnwatcharasupat The splitting idea didn't work, even after I fixed it to actually concat the chunks because in the end, I am still going to concatenate np arrays that will eventually reach the shape of (7, 1920000, 200), which is unhandleable anyway. I had an idea to not concatenate them at all, but to export them to the db_data in get_split method, like this for example:

db_data = {
    'features': features,
    'features_2': features_2,
    'features_3': features_3,
    'features_4': features_4,
    'sed_targets': sed_targets,
    'doa_targets': doa_targets,
    'feature_chunk_idxes': feature_chunk_idxes,
    'gt_chunk_idxes': gt_chunk_idxes,
    'filename_list': filename_list,
    'test_batch_size': test_batch_size,
    'feature_chunk_len': self.chunk_len,
    'gt_chunk_len': self.chunk_len // self.label_upsample_ratio
}

where features, features_2, features_3, and features_4 are just features, but splitted into 4 chunks. And then adjust the use of features in the whole project to include the other features sequentially. I have already developed such a method to export 4 arrays, but I am still exploring the code to better understand it before changing how it works. Currently, I can see that the get_split method is called when training in the datamodule.py file, specifically in

train_db = self.feature_db.get_split(split=self.train_split, split_meta_dir=self.split_meta_dir, stage='fit')

and in

val_db = self.feature_db.get_split(split=self.val_split, split_meta_dir=self.split_meta_dir, stage='inference')

The call from train_db variable is currently my problem.
If you have an idea how to add the chunks part to the code, please let me know.

@kwatcharasupat
Copy link
Collaborator

Hi @Peter-72, you are welcome to create a PR. Thanks!

@kwatcharasupat kwatcharasupat added the enhancement New feature or request label May 17, 2022
@kwatcharasupat kwatcharasupat added the will-take-a-while It will likely be a while before a proper fix is made. Please do not expect an immediate fix. label May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request will-take-a-while It will likely be a while before a proper fix is made. Please do not expect an immediate fix.
Projects
None yet
Development

No branches or pull requests

2 participants