Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Remove SSH jump pod for port-forward mode #3657

Merged
merged 16 commits into from
Jun 30, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Jun 11, 2024

Closes #3566. SSH jump pod is not required when using port-forward mode. This PR directly kubectl port-forwards to the head pod.

Also removes the sleep in our proxycommand. This was previously required for thread-safe concurrent SSH connections when SkyPilot was using SSHCommandRunner for Kubernetes (#2628), but with #3157, SSH is no longer used. This improves SSH connection latency significantly (~2 seconds). Up to 5 concurrent SSH connection requests work fine without the sleep, which should be enough for most usage of SSH outside of SkyPilot.

Also lays the groundwork for easy switching between kubecontexts/kubeconfigs while retaining SSH functionality (requested by user).

Benchmarks

======= This branch - After removing SSH Jump pod and sleep =======

1: multitime -n 5 ssh test ls
            Mean        Std.Dev.    Min         Median      Max
real        1.801       0.113       1.732       1.751       2.027
user        0.021       0.002       0.019       0.021       0.024
sys         0.008       0.001       0.007       0.008       0.010

======= Master - with SSH Jump pod =======

1: multitime -n 5 ssh test ls
            Mean        Std.Dev.    Min         Median      Max
real        3.466       0.123       3.278       3.500       3.605
user        0.024       0.004       0.019       0.022       0.029
sys         0.008       0.002       0.006       0.007       0.010

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests: sky launch -c test --num-nodes 2 --cloud kubernetes, followed by ssh test ls and ssh test-worker1 ls
  • Backward compatibility tests
  • Kubernetes smoke tests

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @romilbhardwaj! This should improve the robustness of our Kubernetes support. The code looks mostly good to me.
Can we test the backward compatibility for existing clusters launched in master?

sky/provision/kubernetes/network_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Jun 19, 2024

Thanks @Michaelvll! Ran manual backward compatibility tests by launching from master -> switching to this branch -> try ssh on master cluster -> launch new cluster -> verify ssh, sky exec for new and old cluster.

Running smoke tests.

@romilbhardwaj
Copy link
Collaborator Author

Ran into an issue with custom images which use a different default username than sky. For such images, the ssh proxy command fails since authentication.py is hardcoded to use [email protected] as the user@ip for jumping. This is slightly tricky since the proxy command is populated before the pod is even started. Looking into a solution for this...

@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Jun 21, 2024

Fixed the custom image support by dynamically updating ProxyCommand once the ssh_user is determined and updated in the cluster handle.

Running smoke tests.

  • Kubernetes smoke tests

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @romilbhardwaj! LGTM with a minor comment.

sky/authentication.py Outdated Show resolved Hide resolved
Comment on lines +3068 to +3071
auth_config = backend_utils.ssh_credential_from_yaml(
handle.cluster_yaml,
ssh_user=handle.ssh_user,
docker_user=handle.docker_user)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test this for other clouds with image_id specified with docker:xxx, just to make sure changing this will not affect ssh for those?

Or, if we have passed in the ssh_user and docker_user here, should we remove the argument of handle.docker_user and handle.ssh_user in the add_cluster function below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - tested with pytest tests/test_smoke.py::test_job_queue_with_docker --gcp.

@romilbhardwaj romilbhardwaj added this to the v0.6.1 milestone Jun 25, 2024
@romilbhardwaj
Copy link
Collaborator Author

Thanks! Tested:

  • pytest tests/test_smoke.py::test_job_queue_with_docker --gcp
  • pytest tests/test_smoke.py --kubernetes

@romilbhardwaj romilbhardwaj merged commit 7633d2e into master Jun 30, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_sshjump_remove_v2 branch June 30, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Remove SSH Jump Pod for networking: port-forward mode
2 participants