WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching #221

thejaminator · 2023-04-29T07:27:18Z

Unfortunately this is pretty big. Had to change a couple of stuff to get FSDP working.

We now call an InferenceServer which serves the model predictions. It works for both FSDP and just one model per gpu style serving.
We aren't depending on huggingface's dataset builder anymore. Previously we were using it for multiprocessing + one model per process. But now we're use our inferenceserver which manages it. And you can't use multiprocessing to call our InferencecServer on separate processes.
Instead we just manually get the results and create our dataset ourselves.
Because of that, we need to DIY our own cache
And to make sure our workers in the inferenceserver are fully utilized, when calling the inference server we call it on multiple threads. The InferenceServer is designed to be threadsafe, so hopefully it works.

You can run it out with

elk elicit huggyllama/llama-{7b,13b,30b,65b} imdb --fsdp_enabled --num_gpus {2-8}

Issues

Figuring out the memory required

The min_gpu_mem can be passed. I indicates the memory for the whole model.

--min_gpu_mem {memory_required_for_whole_model}

mkl

You may encounter an error like this:
Github issue

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

To fix, run this before running. Still figuring out why this is happening. Its supposed to be fixed with the latest mkl package, but it aint for me.

export MKL_THREADING_LAYER=GNU

too many open files

Sometimes it'll complain about too many open files inccrease the ulimit ``` ulimit -n 4048 ```

❤️ QA instructions

checkout to this branch refactor-datasets-usage
Run elicit with huggyllama 7b with these variations.
For each of the runs, check that the eval.csv are roughly the same. and lmk if it crashes.
Note that we are disabling the cache for extracting here. Otherwise subsequent elicit runs won't actually run the extraction with llama, it will just reuse it.

With fsdp. This shards the model on each device.

elk elicit huggyllama/llama-7b imdb --fsdp_enabled --num_gpus 2 --disable_cache

Without fsdp, but multigpu. This duplicates the model on each device

elk elicit huggyllama/llama-7b imdb ----num_gpus 2 --disable_cache

Now compare this to the main branch. Does llama-7b take significantly slower?

If the above seems to work without crashing, and if you are feeling ambitious.
you can merge in the latest changes into this branchand fix the conflicts. may be confusing though.

This reverts commit 9945aeb.

This reverts commit 5ce2725.

for more information, see https://pre-commit.ci

thejaminator · 2023-04-30T15:59:57Z

elk/inference_server/fsdp.py

+ result_queue_dict=self._result_queues,
+ )
+ )
+ return result[0]


thejaminator · 2023-04-30T16:02:28Z

elk/extraction/extraction.py

+ raise ValueError(f"Invalid token_loc: {token_loc}")
+ converted_hiddens = pytree_map(lambda x: x.cpu().share_memory_(), hiddens)
+
+ return SmallerOutput(lm_logits=returned_logits, hidden_states=converted_hiddens)


the function to run on the server. note that you need to shift the shared tensor to the worker's device properly to run, then you'll also need to remember to bring it back to cpu shared so it can be added back to the queue

thejaminator added 30 commits April 26, 2023 23:09

stop using the generation builder

0b159f1

fix partial issues

f928d60

test use accelerate

203bf5b

test use accelerate

8058a38

fix devices hopefully

3c0178c

fix partial

ec5a436

print the accelerate device

4035ac4

give up on accelerate

b22c0dc

add fsdp

02c2f11

save changes

b384b8a

commit wroking

bae6e1e

refactor exception

9c0845d

remove unneeded

e5c132f

commit working dumb method

a8d161f

set format in main process

c4a17f6

clone tensor

8b5d0d3

change fsdp to disable cpu_offload

538b119

set format to torch in test

808b152

fix closure bug not sharing memory

8ce7326

more logs

15834e8

log the output sent back

99c1cb9

more logs

08f6fbc

shift it back to float32

f509688

print loaded closure

439cf8f

add logging of sentinel

ae4e052

fix deadlock maybe?

6ab5e53

add print for breaking

ae5eb25

more prints

0a0fc40

set low min mem for fsdp

6d7fa08

set low min mem for fsdp

820388f

thejaminator and others added 9 commits April 30, 2023 02:53

fix tests

991c39d

Revert "fix "cpu" being passed wrongly"

bfcfb94

This reverts commit 9945aeb.

Revert "Try cpu computation for logits"

f0b224c

This reverts commit 5ce2725.

reduce output size

1bd1cfc

[pre-commit.ci] auto fixes from pre-commit.com hooks

9fa4cf8

for more information, see https://pre-commit.ci

dramatically reduce the returned stuff

fa739ac

[pre-commit.ci] auto fixes from pre-commit.com hooks

74e30ba

for more information, see https://pre-commit.ci

bring the outputs back to the correct device, disable compile

040ee95

bring to the correct device by adding a device to the func to run

7fabb36

thejaminator force-pushed the refactor-datasets-usage branch 2 times, most recently from fbf58dd to 7fabb36 Compare April 29, 2023 20:11

pre-commit-ci bot and others added 14 commits April 29, 2023 20:11

[pre-commit.ci] auto fixes from pre-commit.com hooks

c0fa66c

for more information, see https://pre-commit.ci

separate func

f5fdd9a

[pre-commit.ci] auto fixes from pre-commit.com hooks

25631df

for more information, see https://pre-commit.ci

fix not reassigning device

11c80ed

[pre-commit.ci] auto fixes from pre-commit.com hooks

d07c374

for more information, see https://pre-commit.ci

fix transfering all the hiddens

ed952ec

[pre-commit.ci] auto fixes from pre-commit.com hooks

f700f25

for more information, see https://pre-commit.ci

refactor threading

a23280d

add timing

c8db9e7

reuse queues

1d1ce1f

print the rank and splits

67d7f57

add the world size into the param

d22e2e4

remove unused compile

8cc60d0

remove 1.1 multiplier

2d98f2f

thejaminator commented Apr 30, 2023

View reviewed changes

elk/inference_server/fsdp.py

result_queue_dict=self._result_queues,

)

)

return result[0]

Copy link

Collaborator Author

thejaminator Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

thejaminator commented Apr 30, 2023

View reviewed changes

thejaminator added 2 commits May 1, 2023 00:11

check if queue_id not already in result_queues

6c73154

print the transformer module found

8f53a4f

thejaminator closed this May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching #221

WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching #221

thejaminator commented Apr 29, 2023 •

edited

Loading

thejaminator Apr 30, 2023

thejaminator Apr 30, 2023 •

edited

Loading

WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching #221

WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching #221

Conversation

thejaminator commented Apr 29, 2023 • edited Loading

Issues

Figuring out the memory required

mkl

too many open files

❤️ QA instructions

thejaminator Apr 30, 2023

Choose a reason for hiding this comment

thejaminator Apr 30, 2023 • edited Loading

Choose a reason for hiding this comment

thejaminator commented Apr 29, 2023 •

edited

Loading

thejaminator Apr 30, 2023 •

edited

Loading