Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on a single GPU #612

Closed
huey2531 opened this issue Apr 24, 2022 · 22 comments
Closed

Running on a single GPU #612

huey2531 opened this issue Apr 24, 2022 · 22 comments
Labels
bug Something isn't working

Comments

@huey2531
Copy link
Contributor

tried merging the checkpoints as described for single GPU
python tools/merge20b.py --input_dir ./20B_checkpoints --output_dir ./20B_checkpoints_merged

However Im getting this error when generating
RuntimeError: Error(s) in loading state_dict for EmbeddingPipe:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50432, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

How can I adjust to make the current model match size 50432? or is it the other way around?

@huey2531 huey2531 added the bug Something isn't working label Apr 24, 2022
@StellaAthena
Copy link
Member

@zphang

@StellaAthena
Copy link
Member

@huey2531 Does #613 solve your problem?

@huey2531
Copy link
Contributor Author

@StellaAthena it does not. had to uncomment those lines or it won't even merge the layer checkpoints

@huey2531
Copy link
Contributor Author

before running merge20b.py the error was

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([25216, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

@HughPH
Copy link
Contributor

HughPH commented May 5, 2022

I would suggest redownloading the slim weights into a new directory, to be sure that you're starting from a known point.

@StellaAthena
Copy link
Member

I’ve been trying to reproduce your issue and failing… I concur with @HughPH that it’s probably worth deleting everything and starting again.

@HughPH
Copy link
Contributor

HughPH commented May 11, 2022

@huey2531 Did you make any progress with this?

@huey2531
Copy link
Contributor Author

ok will delete everything and start from scratch

@StellaAthena
Copy link
Member

@huey2531 I can confirm that another individual got this running last week without hitting that error.

@HughPH
Copy link
Contributor

HughPH commented May 24, 2022

@huey2531 Did you get it working?

@igor0
Copy link

igor0 commented May 25, 2022

This looks like a tokenizer mismatch to me:

  • Checkpoint assumes 50432 tokens
  • Model assumes 50304 tokens

Do you have the right tokenizer for the 20B model configured when running generate.py? Should be something like this

    "tokenizer_type": "HFTokenizer",
    "vocab-file": "/mnt/data/20B_tokenizer.json",

@HughPH
Copy link
Contributor

HughPH commented Jun 7, 2022

@huey2531 are you still trying to get this going?

@StellaAthena It's been 3 weeks, I'd suggest this could probably be closed if another week passes without activity.

@StellaAthena
Copy link
Member

Closing due to inactivity.

@huey2531
Copy link
Contributor Author

huey2531 commented Jun 8, 2022

I'm still working on this. stuck on some dependency issues and need to reinstall the OS...

@huey2531
Copy link
Contributor Author

huey2531 commented Jun 8, 2022

This looks like a tokenizer mismatch to me:

  • Checkpoint assumes 50432 tokens
  • Model assumes 50304 tokens

Do you have the right tokenizer for the 20B model configured when running generate.py? Should be something like this

    "tokenizer_type": "HFTokenizer",
    "vocab-file": "/mnt/data/20B_tokenizer.json",

yes, i have the correct tokenizer

@zphang
Copy link
Contributor

zphang commented Jun 8, 2022

Do you have make_vocab_size_divisible_by set to 50432 in the config?

@huey2531
Copy link
Contributor Author

huey2531 commented Jun 8, 2022

Do you have make_vocab_size_divisible_by set to 50432 in the config?

probably not. I simply changed the path of my original config to 20B_checkpoints_merged
will test it after I reinstall the OS

@StellaAthena StellaAthena reopened this Jun 8, 2022
@huey2531
Copy link
Contributor Author

huey2531 commented Jun 8, 2022

now I'm stuck at #628 during installation. did not encounter this issue a few weeks ago.

@StellaAthena
Copy link
Member

@huey2531 This seems to be something that broke recently inside of Triton. I can't install on a fresh machine but my previously existing implementations (from a couple weeks ago) work fine lol

@huey2531
Copy link
Contributor Author

huey2531 commented Jun 9, 2022

@StellaAthena what version is your Triton? pip show triton
what does it say?

@StellaAthena
Copy link
Member

It says

Name: triton
Version: 1.0.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/ptillet/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: UNKNOWN
Location: /home/mchorse/.local/lib/python3.8/site-packages
Requires: torch
Required-by: deepspeed

@StellaAthena
Copy link
Member

Closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants