Some model checkpoints are expired or not available #14

vishaal27 · 2024-01-23T00:30:44Z

Hey,
I was running model evaluations on my own custom data-split for all models in the registry using:

python eval.py --gpus 0 --models <model> --eval-settings custom_dataset

where <model> comes from all the models in the registry (python db.py --list-models-registry).
However, for many of the models, I see a pickling error due to the checkpoint not being loaded correctly. See stack-trace below:

Traceback (most recent call last):                                                       
  File "/lib/python3.8/site-packages/
torch/multiprocessing/spawn.py", line 69, in _wrapfn(i, *args)
  File "imagenet-testbed/src/inference.py", line 64, in main_worker
    model = py_model.generate_classifier(py_eval_setting)
  File "imagenet-testbed/src/models/model_base.py", line 76, in generate_classifier
    self.classifier = self.classifier_loader()
  File "imagenet-testbed/src/models/low_accuracy.py", line 100, in load_resnet
    load_model_state_dict(net, model_name)
  File "imagenet-testbed/src/mldb/utils.py",
line 98, in load_model_state_dict
    state_dict = torch.load(bio, map_location=f'cpu')
  File "/lib/python3.8/site-packages/
torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/lib/python3.8/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

I see that this error happens for all of the low-resource models like resnet18_100k_x_epochs, resnet18_50k_x_epochs etc. To fully ensure this is not an artefact of my own custom data-split, I also tested this on the imagenet-val split with no success.
Are the low-resource models not available as checkpoints from the server?

Also, another set of errors I get when running this is due to some checkpoints still being stored on the vasa endpoint, see:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://vasa.millennium.berkeley.edu:9000/robustness-eval/checkpoints/3NL5sQy84F9nefxVCVDzew_data.bytes"

Are some of the checkpoints not migrated fully yet?

Sorry for the long verbose issue, but hope we can get this resolved :)

The text was updated successfully, but these errors were encountered:

rtaori · 2024-01-23T00:39:47Z

Hi @vishaal27,
Unfortunately some checkpoints are not online as they are on vasa and have not been migrated to the gcloud bucket yet. I'm not sure if/when they'll come online, as the path to migrating them is not so straightforward as I have lost my berkeley access now :)

vishaal27 · 2024-01-23T00:44:04Z

Hey @rtaori, thanks for your blazingly fast response! Is there anyone else with access who would be able to check this?

rtaori · 2024-01-23T00:45:45Z

Potentially, let me check. But if you don't hear back within the next week, then probably there's no way to get these checkpoints :(

vishaal27 · 2024-01-23T00:46:49Z

Sure thanks for checking, really appreciate this :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some model checkpoints are expired or not available #14

Some model checkpoints are expired or not available #14

vishaal27 commented Jan 23, 2024

rtaori commented Jan 23, 2024

vishaal27 commented Jan 23, 2024

rtaori commented Jan 23, 2024

vishaal27 commented Jan 23, 2024

Some model checkpoints are expired or not available #14

Some model checkpoints are expired or not available #14

Comments

vishaal27 commented Jan 23, 2024

rtaori commented Jan 23, 2024

vishaal27 commented Jan 23, 2024

rtaori commented Jan 23, 2024

vishaal27 commented Jan 23, 2024