Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some model checkpoints are expired or not available #14

Open
vishaal27 opened this issue Jan 23, 2024 · 4 comments
Open

Some model checkpoints are expired or not available #14

vishaal27 opened this issue Jan 23, 2024 · 4 comments

Comments

@vishaal27
Copy link
Contributor

Hey,
I was running model evaluations on my own custom data-split for all models in the registry using:

python eval.py --gpus 0 --models <model> --eval-settings custom_dataset

where <model> comes from all the models in the registry (python db.py --list-models-registry).
However, for many of the models, I see a pickling error due to the checkpoint not being loaded correctly. See stack-trace below:

Traceback (most recent call last):                                                       
  File "/lib/python3.8/site-packages/
torch/multiprocessing/spawn.py", line 69, in _wrapfn(i, *args)
  File "imagenet-testbed/src/inference.py", line 64, in main_worker
    model = py_model.generate_classifier(py_eval_setting)
  File "imagenet-testbed/src/models/model_base.py", line 76, in generate_classifier
    self.classifier = self.classifier_loader()
  File "imagenet-testbed/src/models/low_accuracy.py", line 100, in load_resnet
    load_model_state_dict(net, model_name)
  File "imagenet-testbed/src/mldb/utils.py",
line 98, in load_model_state_dict
    state_dict = torch.load(bio, map_location=f'cpu')
  File "/lib/python3.8/site-packages/
torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/lib/python3.8/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

I see that this error happens for all of the low-resource models like resnet18_100k_x_epochs, resnet18_50k_x_epochs etc. To fully ensure this is not an artefact of my own custom data-split, I also tested this on the imagenet-val split with no success.
Are the low-resource models not available as checkpoints from the server?

Also, another set of errors I get when running this is due to some checkpoints still being stored on the vasa endpoint, see:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://vasa.millennium.berkeley.edu:9000/robustness-eval/checkpoints/3NL5sQy84F9nefxVCVDzew_data.bytes"

Are some of the checkpoints not migrated fully yet?

Sorry for the long verbose issue, but hope we can get this resolved :)

@rtaori
Copy link
Collaborator

rtaori commented Jan 23, 2024

Hi @vishaal27,
Unfortunately some checkpoints are not online as they are on vasa and have not been migrated to the gcloud bucket yet. I'm not sure if/when they'll come online, as the path to migrating them is not so straightforward as I have lost my berkeley access now :)

@vishaal27
Copy link
Contributor Author

Hey @rtaori, thanks for your blazingly fast response! Is there anyone else with access who would be able to check this?

@rtaori
Copy link
Collaborator

rtaori commented Jan 23, 2024

Potentially, let me check. But if you don't hear back within the next week, then probably there's no way to get these checkpoints :(

@vishaal27
Copy link
Contributor Author

Sure thanks for checking, really appreciate this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants