Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to HF datasets for hidden state storage #61

Closed
norabelrose opened this issue Feb 15, 2023 · 2 comments · Fixed by #95
Closed

Migrate to HF datasets for hidden state storage #61

norabelrose opened this issue Feb 15, 2023 · 2 comments · Fixed by #95
Assignees
Labels
bug Something isn't working enhancement New feature or request refactor Code change for clarity/extensibility/etc.

Comments

@norabelrose
Copy link
Member

Storing hidden states in a HuggingFace dataset is a cleaner and more scalable solution which makes it easier to store metadata along with the hidden states, and load subsets of the hidden states into memory

@norabelrose norabelrose added the enhancement New feature or request label Feb 15, 2023
@norabelrose norabelrose added this to the PyPI 0.1 Release milestone Feb 15, 2023
@norabelrose
Copy link
Member Author

norabelrose commented Feb 15, 2023

Currently trying to extract hiddens from EleutherAI/pythia-12b-deduped on all of IMDB leads to an out-of-memory error because of how we're handling this right now, so it's also sort of a bug

@norabelrose norabelrose added bug Something isn't working refactor Code change for clarity/extensibility/etc. labels Feb 15, 2023
@AlexWan0
Copy link
Collaborator

Added branch multiprocessing. It's pretty rough right now -- three biggest things missing are 1) specify Features in the dataset (right now it just pushes hiddens and labels), 2) fixing an annoying multiprocessing bug, and 3) making sure the returned Dataset object by extract_hiddens still works with everything else.

To elaborate a bit on (2), datasets uses os.fork to create new processes, but most people recommend you use os.spawn instead as otherwise you can get the error when you try to move your model onto GPU: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method.

One option would be to force Datasets to use os.spawn when creating new processes, but that might involve messing the internals of their implementation (I think) which would not be ideal. Especially because you can actually use CUDA with os.fork -- my barebones implementation of saving hidden states works fine with multiple processes + os.fork + CUDA. It's when I integrate it into the elk that it breaks. My guess right now is that somewhere before extract_hiddens is called, some part of the model is secretly being moved onto GPU, and CUDA ends up freaking out if you try to reinitialize it on a model that was already on CUDA in the main process on a forked process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request refactor Code change for clarity/extensibility/etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants