Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure with setup-ssh on Amazon Linux 2023 runners #129152

Closed
huydhn opened this issue Jun 20, 2024 · 1 comment
Closed

Failure with setup-ssh on Amazon Linux 2023 runners #129152

huydhn opened this issue Jun 20, 2024 · 1 comment
Labels
ci: sev critical failure affecting PyTorch CI

Comments

@huydhn
Copy link
Contributor

huydhn commented Jun 20, 2024

Earlier today, we rolled out Amazon Linux 2023 runners and this starts to cause setup-ssh to fail across the board. The change has been reverted https://github.com/pytorch-labs/pytorch-gha-infra/pull/417, and Amazon Linux 2 runners are being rolled out again. So, plz retry your failed CI jobs.

Root cause

The action https://github.com/pytorch/test-infra/blob/main/setup-ssh/src/main.ts#L89-L92 gets the current public IP of the runner by calling getEC2metadata, and if that fails, we call getIPs to api64.ipify.org to get the same information. On amzn2023, IMDSv2 is the only option to query ec2 medata https://aws.amazon.com/blogs/security/get-the-full-benefits-of-imdsv2-and-disable-imdsv1-across-your-aws-infrastructure/, so getEC2metadata always fails and the fallback to api64.ipify.org changes from rarely used to always being used, which causes rate limit. Testing on canary didn’t reveal anything because it didn’t have enough traffic to api64.ipify.org to be throttle

@huydhn huydhn added the ci: sev critical failure affecting PyTorch CI label Jun 20, 2024
@huydhn
Copy link
Contributor Author

huydhn commented Jun 21, 2024

Another anecdote on Amazon Linux 2023 upgrade, I suspect that some distributed NCCL tests are failing there, i.e. https://github.com/pytorch/pytorch/actions/runs/9600703357/job/26478744763#step:20:3099. A quick look at the code https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py#L4023-L4028 seems to say that a timed out process now returns None instead of SIGABRT (6).

cc @jeanschmidt in case you see the failure on canary.

@huydhn huydhn closed this as completed Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: sev critical failure affecting PyTorch CI
Projects
None yet
Development

No branches or pull requests

1 participant