Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable entropy on multihost CPUs. #631

Merged
merged 1 commit into from
Apr 30, 2024
Merged

Conversation

RoshaniN
Copy link
Collaborator

@RoshaniN RoshaniN commented Apr 30, 2024

  • Standalone checkpointer was not working well on multihost CPUs. Recommendation was to refer to JAX arrays using with jax.spmd_mode('allow_all'):
  • Simplified entropy addition.

Thanks for debugging with me @gobbleturk , thanks @raymondzouu for reporting this!

Tested using -

python3 xpk.py workload create --command "JAX_PLATFORMS=cpu python3 MaxText/standalone_checkpointer.py MaxText/configs/base.yml \
base_output_directory=gs:https://maxtext-experiments-multipod dataset_path=gs:https://max-datasets-rogue steps=120 checkpoint_period=30 enable_checkpointing=True async_checkpointing=False per_device_batch_size=1 ici_data_parallelism=1 ici_fsdp_parallelism=64 hardware=cpu;" \
--cluster test2-n2-standard-32-32 \
--docker-image=gcr.io/tpu-prod-env-multipod/roshanin_runner \
--num-slices=1 \
--device-type=n2-standard-32-64 \
--zone=us-central1-b \
--workload=roshanin-ps-12

@RoshaniN
Copy link
Collaborator Author

RoshaniN commented Apr 30, 2024

Seeing unrelated test failure https://screenshot.googleplex.com/zf7vN5462q6aU4T

@RoshaniN
Copy link
Collaborator Author

Seeing unrelated test failure https://screenshot.googleplex.com/zf7vN5462q6aU4T

#629 should fix the error.

@copybara-service copybara-service bot merged commit 92f3abf into main Apr 30, 2024
8 checks passed
@copybara-service copybara-service bot deleted the roshanin_entropy_2 branch April 30, 2024 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants