Migrate to HF datasets for hidden state storage #61

norabelrose · 2023-02-15T00:08:04Z

Storing hidden states in a HuggingFace dataset is a cleaner and more scalable solution which makes it easier to store metadata along with the hidden states, and load subsets of the hidden states into memory

norabelrose · 2023-02-15T00:10:24Z

Currently trying to extract hiddens from EleutherAI/pythia-12b-deduped on all of IMDB leads to an out-of-memory error because of how we're handling this right now, so it's also sort of a bug

AlexWan0 · 2023-02-18T09:06:15Z

Added branch multiprocessing. It's pretty rough right now -- three biggest things missing are 1) specify Features in the dataset (right now it just pushes hiddens and labels), 2) fixing an annoying multiprocessing bug, and 3) making sure the returned Dataset object by extract_hiddens still works with everything else.

To elaborate a bit on (2), datasets uses os.fork to create new processes, but most people recommend you use os.spawn instead as otherwise you can get the error when you try to move your model onto GPU: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method.

One option would be to force Datasets to use os.spawn when creating new processes, but that might involve messing the internals of their implementation (I think) which would not be ideal. Especially because you can actually use CUDA with os.fork -- my barebones implementation of saving hidden states works fine with multiple processes + os.fork + CUDA. It's when I integrate it into the elk that it breaks. My guess right now is that somewhere before extract_hiddens is called, some part of the model is secretly being moved onto GPU, and CUDA ends up freaking out if you try to reinitialize it on a model that was already on CUDA in the main process on a forked process.

norabelrose added the enhancement New feature or request label Feb 15, 2023

norabelrose added this to the PyPI 0.1 Release milestone Feb 15, 2023

norabelrose assigned AlexTMallen Feb 15, 2023

norabelrose added bug Something isn't working refactor Code change for clarity/extensibility/etc. labels Feb 15, 2023

norabelrose assigned AlexWan0 Feb 16, 2023

norabelrose mentioned this issue Feb 25, 2023

Migrate to HF datasets for storing extracted hidden states #95

Merged

norabelrose linked a pull request Feb 25, 2023 that will close this issue

Migrate to HF datasets for storing extracted hidden states #95

Merged

norabelrose closed this as completed in #95 Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to HF datasets for hidden state storage #61

Migrate to HF datasets for hidden state storage #61

norabelrose commented Feb 15, 2023

norabelrose commented Feb 15, 2023 •

edited

Loading

AlexWan0 commented Feb 18, 2023

Migrate to HF datasets for hidden state storage #61

Migrate to HF datasets for hidden state storage #61

Comments

norabelrose commented Feb 15, 2023

norabelrose commented Feb 15, 2023 • edited Loading

AlexWan0 commented Feb 18, 2023

norabelrose commented Feb 15, 2023 •

edited

Loading