Fix spot tpu bug #1717

infwinston · 2023-02-23T22:20:55Z

Fixes

[Spot] FAILED_CONTROLLER error with spot launch after 30hrs of retry #1694
misclassification of user program error and preeemption error

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

concretevitamin

Thanks @infwinston, some comments. Also:

For removing colorama.init(), should we call this func somewhere https://github.com/tartley/colorama#usage? After that, perhaps ask @romilbhardwaj to check if colors still look OK on Windows?

concretevitamin · 2023-02-24T05:08:22Z

sky/spot/recovery_strategy.py

+ logger.info('wait for 30 seconds and retry...')
+ time.sleep(30)


concretevitamin · 2023-02-24T05:08:47Z

sky/spot/recovery_strategy.py

+ import traceback # pylint: disable=import-outside-toplevel
+ logger.error(f' Detailed exception: {e}')
+ logger.info(f' Traceback: {traceback.format_exc()}')


IIRC, the existing usage of format_exc in this module didn't print out the stacktrace. Could we double check?

I think we do need format_exc to print out the full trace. I tried one simple example below

def func(): raise ValueError("test") try: func() except Exception as e: print(f'{type(e)}: {str(e)}') print('===full trace===') import traceback print(traceback.format_exc())

and got output

<class 'ValueError'>: test ===full trace=== Traceback (most recent call last): File "sky/test_err.py", line 5, in <module> func() File "sky/test_err.py", line 2, in func raise ValueError("test") ValueError: test

Could we try a real spot launch example? IIRC, that didn't produce a usable trace.

I comment out this line and the entire trace will then show

skypilot/sky/utils/ux_utils.py

Line 33 in 07ff922

sys.tracebacklimit = 0

import sky try: t = sky.Task(run='ls') t.set_resources(sky.Resources(sky.GCP())) sky.launch(task=t, cluster_name='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') except Exception as e: print(f'{type(e)}: {str(e)}') print('===full trace===') import traceback print(traceback.format_exc())

Output:

... <class 'sky.exceptions.InvalidClusterNameError'>: Cluster name 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' has 60 chars; maximum length is 35 chars on GCP. ===full trace=== Traceback (most recent call last): File "test_err.py", line 11, in <module> sky.launch(task=t, cluster_name='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') File "/home/gcpuser/sky_workdir/skypilot/sky/utils/common_utils.py", line 241, in _record return f(*args, **kwargs) File "/home/gcpuser/sky_workdir/skypilot/sky/utils/common_utils.py", line 241, in _record return f(*args, **kwargs) File "/home/gcpuser/sky_workdir/skypilot/sky/execution.py", line 411, in launch _execute( File "/home/gcpuser/sky_workdir/skypilot/sky/execution.py", line 260, in _execute handle = backend.provision(task, File "/home/gcpuser/sky_workdir/skypilot/sky/utils/common_utils.py", line 241, in _record return f(*args, **kwargs) File "/home/gcpuser/sky_workdir/skypilot/sky/utils/common_utils.py", line 220, in _record return f(*args, **kwargs) File "/home/gcpuser/sky_workdir/skypilot/sky/backends/backend.py", line 49, in provision return self._provision(task, to_provision, dryrun, stream_logs, File "/home/gcpuser/sky_workdir/skypilot/sky/backends/cloud_vm_ray_backend.py", line 2129, in _provision to_provision_config = self._check_existing_cluster( File "/home/gcpuser/sky_workdir/skypilot/sky/utils/common_utils.py", line 241, in _record return f(*args, **kwargs) File "/home/gcpuser/sky_workdir/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3231, in _check_existing_cluster task_cloud.check_cluster_name_is_valid(cluster_name) File "/home/gcpuser/sky_workdir/skypilot/sky/clouds/cloud.py", line 408, in check_cluster_name_is_valid raise exceptions.InvalidClusterNameError( sky.exceptions.InvalidClusterNameError: Cluster name 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' has 60 chars; maximum length is 35 chars on GCP.

Was it because of sys.tracebacklimit = 0 so it didn't produce trace?

concretevitamin · 2023-02-24T05:33:00Z

sky/backends/backend_utils.py

+ ip = (endpoint.get('ipAddress', None) if get_internal_ips else
+ endpoint['accessConfig'].get('externalIp', None))


Comment on why we do this safeguard?

concretevitamin · 2023-02-24T05:40:24Z

sky/backends/backend_utils.py

- returncode = runner.run('ray status', stream_logs=False)
- if returncode:
+ use_spot = handle.launched_resources.use_spot
+ # if cluster is not spot, we can determine its health by "ray status".


This change makes sense to me. Cc @Michaelvll to take another look.

Can we expand the comment here? E.g., mention including the following

For non-spot clusters, this is an optimization: we call external_ips() and/or SSH into the cluster to run 'ray status' in various cases to determine cluster health, because these may be faster than querying the true node statuses from the cloud provider.

For spot clusters, the above can be unsafe. Therefore we directly query the cloud provider.

Good point. I've updated the comments.

Michaelvll · 2023-03-03T05:03:49Z

sky/backends/backend_utils.py

+ # the true node statuses from the cloud provider.
+ # For spot clusters, the above can be unsafe.
+ # Therefore we directly query the cloud provider.
+ if not use_spot:


We should check the ray status on the spot clusters as well. It is possible that the ray cluster is not running, but the cloud provider shows UP, when the user call ray stop manually. Our job queue depends on the healthiness of the ray cluster, if the ray cluster is not running, we should consider the cluster as INIT.

That is to say, we previously assume ray cluster is healthy indicates cloud provider shows UP, but based on your assumption both should be checked for the spot cluster.

that's a good point. I'm looking at the behavior of the current codebase. It seems to me if ray cluster is not running. we will also determine the cluster status by just querying the cloud provider. and if it's running, we will set the state to UP.

skypilot/sky/backends/backend_utils.py

Lines 1785 to 1788 in 35d017d

returncode = runner.run('ray status', stream_logs=False)

if returncode:

raise exceptions.FetchIPError(

reason=exceptions.FetchIPError.Reason.HEAD)

Do you mean we should also set INIT for on-demand clusters if ray is not running?

No, we will never set the state to UP after failing to get the IPs. The code will go through the following code path and set the cluster to INIT. We should probably maintain that behavior.

skypilot/sky/backends/backend_utils.py

Lines 1856 to 1890 in 1696cce

if is_abnormal:

backend = get_backend_from_handle(handle)

if isinstance(backend,

backends.CloudVmRayBackend) and record['autostop'] >= 0:

if not backend.is_definitely_autostopping(handle,

stream_logs=False):

# Reset the autostopping as the cluster is abnormal, and may

# not correctly autostop. Resetting the autostop will let

# the user know that the autostop may not happen to avoid

# leakages from the assumption that the cluster will autostop.

try:

backend.set_autostop(handle, -1, stream_logs=False)

except (Exception, SystemExit) as e: # pylint: disable=broad-except

logger.debug(f'Failed to reset autostop. Due to '

f'{common_utils.format_exception(e)}')

global_user_state.set_cluster_autostop_value(

handle.cluster_name, -1, to_down=False)

else:

ux_utils.console_newline()

operation_str = 'autodowning' if record[

'to_down'] else 'autostopping'

logger.info(

f'Cluster {cluster_name!r} is {operation_str}. Setting to '

'INIT status; try refresh again in a while.')

# If the user starts part of a STOPPED cluster, we still need a status

# to represent the abnormal status. For spot cluster, it can also

# represent that the cluster is partially preempted.

# TODO(zhwu): the definition of INIT should be audited/changed.

# Adding a new status UNHEALTHY for abnormal status can be a choice.

global_user_state.add_or_update_cluster(cluster_name,

handle,

requested_resources=None,

ready=False,

is_launch=False)

ah sorry I misread the code. then this makes sense. let me add this case to the change.

I modified the code to also check ray cluster healthiness for Spot clusters. Now only when ray_cluster_up and all clusters are running, we set the state UP. Does this look okay?

Michaelvll

Thanks for adding the fix @infwinston! The main concern is that we should check the healthiness of the ray cluster for the spot cluster as well. Please see the comment below for the detail.

Michaelvll · 2023-03-03T05:06:35Z

sky/backends/backend_utils.py

+ # the true node statuses from the cloud provider.
+ # For spot clusters, the above can be unsafe.
+ # Therefore we directly query the cloud provider.
+ if not use_spot:


That is to say, we previously assume ray cluster is healthy indicates cloud provider shows UP, but based on your assumption both should be checked for the spot cluster.

sky/backends/cloud_vm_ray_backend.py

sky/spot/recovery_strategy.py

Michaelvll

Thank you for the fix @infwinston! Left several comments.

Michaelvll · 2023-03-14T18:08:08Z

sky/backends/backend_utils.py

+ ip = (endpoint.get('ipAddress', None) if get_internal_ips else
+ endpoint['accessConfig'].get('externalIp', None))


nit: having the expanded if...else... may be easier to read?

sky/backends/backend_utils.py

Michaelvll · 2023-03-14T18:17:16Z

sky/backends/backend_utils.py

 # If we get node ips correctly, the cluster is UP. It is safe to
 # set the status to UP, as the `handle.external_ips` function uses ray
 # to fetch IPs and starting ray is the final step of sky launch.


Actually, it may not be true for multi-node either, as the ray get-node-ips only use the tag on the cloud without actually checking the ray cluster status. We may want to make L1778 check status for all the nodes instead.

Ah I see. I remove the if and now it also run ray status for multi-node case. just want to make sure that's what we want right?

I was thinking about something similar as the following code to check the number of workers is the same as the expected number with ray status.

We can refactor this part of code out and use it?

skypilot/sky/backends/backend_utils.py

Lines 1033 to 1078 in 8628a25

rc, output, stderr = runner.run('ray status',

log_path=log_path,

stream_logs=False,

require_outputs=True,

separate_stderr=True)

subprocess_utils.handle_returncode(

rc, 'ray status', 'Failed to run ray status on head node.',

stderr)

logger.debug(output)

# Workers that are ready

ready_workers = 0

# On-prem/local case is handled differently.

# `ray status` produces different output for local case, and

# we poll for number of nodes launched instead of counting for

# head and number of worker nodes separately (it is impossible

# to distinguish between head and worker node for local case).

if is_local_cloud:

result = _LAUNCHED_LOCAL_WORKER_PATTERN.findall(output)

# In the local case, ready_workers mean the total number

# of nodes launched, including head.

ready_workers = len(result)

else:

result = _LAUNCHED_WORKER_PATTERN.findall(output)

if len(result) == 0:

ready_workers = 0

else:

assert len(result) == 1, result

ready_workers = int(result[0])

result = _LAUNCHED_HEAD_PATTERN.findall(output)

ready_head = 0

if result:

assert len(result) == 1, result

ready_head = int(result[0])

assert ready_head <= 1, ready_head

worker_status.update('[bold cyan]'

f'{ready_workers} out of {num_nodes - 1} '

'workers ready')

# In the local case, ready_head=0 and ready_workers=num_nodes. This

# is because there is no matching regex for _LAUNCHED_HEAD_PATTERN.

if ready_head + ready_workers == num_nodes:

# All nodes are up.

break

Ahh got it. Parsing the output from ray status is needed. I just updated the code. PTAL.

Michaelvll · 2023-03-16T01:29:36Z

sky/backends/backend_utils.py

 # If we get node ips correctly, the cluster is UP. It is safe to
 # set the status to UP, as the `handle.external_ips` function uses ray
 # to fetch IPs and starting ray is the final step of sky launch.


I was thinking about something similar as the following code to check the number of workers is the same as the expected number with ray status.

We can refactor this part of code out and use it?

skypilot/sky/backends/backend_utils.py

Lines 1033 to 1078 in 8628a25

rc, output, stderr = runner.run('ray status',

log_path=log_path,

stream_logs=False,

require_outputs=True,

separate_stderr=True)

subprocess_utils.handle_returncode(

rc, 'ray status', 'Failed to run ray status on head node.',

stderr)

logger.debug(output)

# Workers that are ready

ready_workers = 0

# On-prem/local case is handled differently.

# `ray status` produces different output for local case, and

# we poll for number of nodes launched instead of counting for

# head and number of worker nodes separately (it is impossible

# to distinguish between head and worker node for local case).

if is_local_cloud:

result = _LAUNCHED_LOCAL_WORKER_PATTERN.findall(output)

# In the local case, ready_workers mean the total number

# of nodes launched, including head.

ready_workers = len(result)

else:

result = _LAUNCHED_WORKER_PATTERN.findall(output)

if len(result) == 0:

ready_workers = 0

else:

assert len(result) == 1, result

ready_workers = int(result[0])

result = _LAUNCHED_HEAD_PATTERN.findall(output)

ready_head = 0

if result:

assert len(result) == 1, result

ready_head = int(result[0])

assert ready_head <= 1, ready_head

worker_status.update('[bold cyan]'

f'{ready_workers} out of {num_nodes - 1} '

'workers ready')

# In the local case, ready_head=0 and ready_workers=num_nodes. This

# is because there is no matching regex for _LAUNCHED_HEAD_PATTERN.

if ready_head + ready_workers == num_nodes:

# All nodes are up.

break

Michaelvll · 2023-03-16T01:34:42Z

sky/backends/backend_utils.py

+ # For spot clusters, the above can be unsafe because the Ray cluster
+ # may remain healty for a while before the cloud completely
+ # terminates the VMs.
+ # Additionally, we query the VM state from the cloud provider.


Is the fix for checking the ray status for multiple nodes already enough? The previous problem might be because the worker VM is preempted, but the IP can still be got with ray get-node-ips?

Did the problem happen for a user with a single node or multiple nodes?

The problem happened to a user with a single node (tpu-v2-8) I believe

Michaelvll · 2023-03-16T03:56:23Z

sky/backends/backend_utils.py

+ # Check if ray cluster status is healthy.
+ ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml)
+ runner = command_runner.SSHCommandRunner(external_ips[0],
+ **ssh_credentials)
+ rc, output, _ = runner.run('ray status',
+ stream_logs=False,
+ require_outputs=True,
+ separate_stderr=True)
+ if rc:
+ raise exceptions.FetchIPError(
+ reason=exceptions.FetchIPError.Reason.HEAD)
+
+ def get_ready_nodes(pattern, output):
+ result = pattern.findall(output)
+ if len(result) == 0:
+ return 0
+ assert len(result) == 1, result
+ return int(result[0])
+
+ ready_workers = get_ready_nodes(_LAUNCHED_WORKER_PATTERN, output)
+ ready_head = get_ready_nodes(_LAUNCHED_HEAD_PATTERN, output)
+ assert ready_head <= 1, f'#head node should be <=1 (Got {ready_head}).'
+
+ if ready_head + ready_workers == handle.launched_nodes:
+ ray_cluster_up = True


instead of directly copying the code here. can we refactor it out as a function so it can be reused in both places? For example,

def count_healthy_nodes_with_ray(runner, is_local: bool) -> Tuple[int, int]: rc, output, _ = runner.run('ray status', stream_logs=False, require_outputs=True, separate_stderr=True) if rc: raise exceptions.FetchIPError( reason=exceptions.FetchIPError.Reason.HEAD) def get_ready_nodes(pattern, output): result = pattern.findall(output) if len(result) == 0: return 0 assert len(result) == 1, result return int(result[0]) ready_workers = get_ready_nodes(_LAUNCHED_WORKER_PATTERN, output) ready_head = get_ready_nodes(_LAUNCHED_HEAD_PATTERN, output) return ready_head, ready_workers

I actually refactor a bit as they differ in some places but yeah I can try taking it out as a function.

I just refactored a bit. I tried to incorporate runner into the function but looks like the output from ray status will be used here.

skypilot/sky/backends/backend_utils.py

Line 1110 in 8628a25

if '(no pending nodes)' in output and '(no failures)' in output:

so I ended up not taking runner but the output as the argument. PTAL, thanks!

Michaelvll

Thanks for the quick fix and refactoring @infwinston! The changes look good to me now. Thanks for the great effort.

Michaelvll · 2023-03-16T05:17:14Z

sky/backends/backend_utils.py

+ def get_ready_nodes(pattern, output, local=False):
+ result = pattern.findall(output)
+ # On-prem/local case is handled differently.
+ # `ray status` produces different output for local case, and
+ # we poll for number of nodes launched instead of counting for
+ # head and number of worker nodes separately (it is impossible
+ # to distinguish between head and worker node for local case).
+ if local:
+ # In the local case, ready_workers mean the total number
+ # of nodes launched, including head.
+ return len(result)
+ if len(result) == 0:
+ return 0
+ assert len(result) == 1, result
+ return int(result[0])
+
+ if is_local_cloud:
+ ready_workers = get_ready_nodes(_LAUNCHED_LOCAL_WORKER_PATTERN,
+ output,
+ local=True)
+ else:
+ ready_workers = get_ready_nodes(_LAUNCHED_WORKER_PATTERN,
+ output,
+ local=False)
+ ready_head = get_ready_nodes(_LAUNCHED_HEAD_PATTERN, output)


nit

Suggested change

def get_ready_nodes(pattern, output, local=False):

result = pattern.findall(output)

# On-prem/local case is handled differently.

# `ray status` produces different output for local case, and

# we poll for number of nodes launched instead of counting for

# head and number of worker nodes separately (it is impossible

# to distinguish between head and worker node for local case).

if local:

# In the local case, ready_workers mean the total number

# of nodes launched, including head.

return len(result)

if len(result) == 0:

return 0

assert len(result) == 1, result

return int(result[0])

if is_local_cloud:

ready_workers = get_ready_nodes(_LAUNCHED_LOCAL_WORKER_PATTERN,

output,

local=True)

else:

ready_workers = get_ready_nodes(_LAUNCHED_WORKER_PATTERN,

output,

local=False)

ready_head = get_ready_nodes(_LAUNCHED_HEAD_PATTERN, output)

def get_ready_nodes(pattern, output):

result = pattern.findall(output)

# On-prem/local case is handled differently.

# `ray status` produces different output for local case, and

# we poll for number of nodes launched instead of counting for

# head and number of worker nodes separately (it is impossible

# to distinguish between head and worker node for local case).

if is_local_cloud:

# In the local case, ready_workers mean the total number

# of nodes launched, including head.

return len(result)

if len(result) == 0:

return 0

assert len(result) == 1, result

return int(result[0])

if is_local_cloud:

ready_head = 0

ready_workers = get_ready_nodes(_LAUNCHED_LOCAL_WORKER_PATTERN, output)

else:

ready_head = get_ready_nodes(_LAUNCHED_HEAD_PATTERN, output)

ready_workers = get_ready_nodes(_LAUNCHED_WORKER_PATTERN, output)

Michaelvll · 2023-03-16T05:17:52Z

sky/backends/backend_utils.py

@@ -1001,6 +1001,40 @@ def get_timestamp_from_run_timestamp(run_timestamp: str) -> float:
 run_timestamp.partition('-')[2], '%Y-%m-%d-%H-%M-%S-%f').timestamp()


+def count_healthy_nodes_from_ray(output: str,


nit: make this function private?

sky/spot/recovery_strategy.py

Co-authored-by: Zhanghao Wu <[email protected]>

…fix-spot-tpu

infwinston · 2023-03-16T07:30:27Z

I just re-run the smoke tests and it passed. merging now. Thanks a lot for the reviews!

infwinston added 5 commits February 18, 2023 17:26

debug

536261e

shorten

df1fa17

debug

6633d36

fix many bugs

0c6d6d6

undo debug

c5ec5e0

infwinston changed the title ~~Fix spot tpu~~ Fix spot tpu bug Feb 23, 2023

concretevitamin reviewed Feb 24, 2023

View reviewed changes

infwinston added 4 commits February 24, 2023 07:11

comment

a0f246a

bug..

088b043

merge

f92fc79

Merge branch 'master' into fix-spot-tpu

3ce8c40

Michaelvll self-requested a review March 3, 2023 02:27

Michaelvll reviewed Mar 3, 2023

View reviewed changes

infwinston added 3 commits March 13, 2023 00:30

Merge branch 'master' into fix-spot-tpu

d4f9d35

fix

1adbd26

fix comments

28e2eeb

Michaelvll reviewed Mar 14, 2023

View reviewed changes

infwinston added 2 commits March 14, 2023 22:48

comments

7e7bc3f

fix

04803e0

Michaelvll reviewed Mar 16, 2023

View reviewed changes

infwinston added 2 commits March 16, 2023 03:42

fix multinode

a2e1eb6

fix

6fe76a8

Michaelvll reviewed Mar 16, 2023

View reviewed changes

infwinston added 2 commits March 16, 2023 05:01

refactor

a9045e0

type

876ab30

Michaelvll approved these changes Mar 16, 2023

View reviewed changes

infwinston and others added 3 commits March 15, 2023 22:25

Update sky/spot/recovery_strategy.py

ca50174

Co-authored-by: Zhanghao Wu <[email protected]>

update

b9ba41c

Merge branch 'fix-spot-tpu' of github.com:skypilot-org/skypilot into …

0f346d4

…fix-spot-tpu

infwinston merged commit 322ffad into master Mar 16, 2023

infwinston deleted the fix-spot-tpu branch March 16, 2023 07:31

		logger.info('wait for 30 seconds and retry...')
		time.sleep(30)

		ip = (endpoint.get('ipAddress', None) if get_internal_ips else
		endpoint['accessConfig'].get('externalIp', None))

	returncode = runner.run('ray status', stream_logs=False)
	if returncode:
	raise exceptions.FetchIPError(
	reason=exceptions.FetchIPError.Reason.HEAD)

	if is_abnormal:
	backend = get_backend_from_handle(handle)
	if isinstance(backend,
	backends.CloudVmRayBackend) and record['autostop'] >= 0:
	if not backend.is_definitely_autostopping(handle,
	stream_logs=False):
	# Reset the autostopping as the cluster is abnormal, and may
	# not correctly autostop. Resetting the autostop will let
	# the user know that the autostop may not happen to avoid
	# leakages from the assumption that the cluster will autostop.
	try:
	backend.set_autostop(handle, -1, stream_logs=False)
	except (Exception, SystemExit) as e: # pylint: disable=broad-except
	logger.debug(f'Failed to reset autostop. Due to '
	f'{common_utils.format_exception(e)}')
	global_user_state.set_cluster_autostop_value(
	handle.cluster_name, -1, to_down=False)
	else:
	ux_utils.console_newline()
	operation_str = 'autodowning' if record[
	'to_down'] else 'autostopping'
	logger.info(
	f'Cluster {cluster_name!r} is {operation_str}. Setting to '
	'INIT status; try refresh again in a while.')

	# If the user starts part of a STOPPED cluster, we still need a status
	# to represent the abnormal status. For spot cluster, it can also
	# represent that the cluster is partially preempted.
	# TODO(zhwu): the definition of INIT should be audited/changed.
	# Adding a new status UNHEALTHY for abnormal status can be a choice.
	global_user_state.add_or_update_cluster(cluster_name,
	handle,
	requested_resources=None,
	ready=False,
	is_launch=False)

	rc, output, stderr = runner.run('ray status',
	log_path=log_path,
	stream_logs=False,
	require_outputs=True,
	separate_stderr=True)
	subprocess_utils.handle_returncode(
	rc, 'ray status', 'Failed to run ray status on head node.',
	stderr)
	logger.debug(output)

	# Workers that are ready
	ready_workers = 0
	# On-prem/local case is handled differently.
	# `ray status` produces different output for local case, and
	# we poll for number of nodes launched instead of counting for
	# head and number of worker nodes separately (it is impossible
	# to distinguish between head and worker node for local case).
	if is_local_cloud:
	result = _LAUNCHED_LOCAL_WORKER_PATTERN.findall(output)
	# In the local case, ready_workers mean the total number
	# of nodes launched, including head.
	ready_workers = len(result)
	else:
	result = _LAUNCHED_WORKER_PATTERN.findall(output)
	if len(result) == 0:
	ready_workers = 0
	else:
	assert len(result) == 1, result
	ready_workers = int(result[0])

	result = _LAUNCHED_HEAD_PATTERN.findall(output)
	ready_head = 0
	if result:
	assert len(result) == 1, result
	ready_head = int(result[0])
	assert ready_head <= 1, ready_head

	worker_status.update('[bold cyan]'
	f'{ready_workers} out of {num_nodes - 1} '
	'workers ready')

	# In the local case, ready_head=0 and ready_workers=num_nodes. This
	# is because there is no matching regex for _LAUNCHED_HEAD_PATTERN.
	if ready_head + ready_workers == num_nodes:
	# All nodes are up.
	break

		@@ -1001,6 +1001,40 @@ def get_timestamp_from_run_timestamp(run_timestamp: str) -> float:
		run_timestamp.partition('-')[2], '%Y-%m-%d-%H-%M-%S-%f').timestamp()


		def count_healthy_nodes_from_ray(output: str,

Fix spot tpu bug #1717

Fix spot tpu bug #1717

Conversation

infwinston commented Feb 23, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Michaelvll Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston commented Mar 16, 2023 • edited Loading

infwinston commented Feb 23, 2023 •

edited

Loading

infwinston Feb 24, 2023 •

edited

Loading

infwinston Mar 13, 2023 •

edited

Loading

Michaelvll Mar 13, 2023 •

edited

Loading

infwinston Mar 15, 2023 •

edited

Loading

Michaelvll Mar 16, 2023 •

edited

Loading

Michaelvll Mar 16, 2023 •

edited

Loading

infwinston Mar 16, 2023 •

edited

Loading

infwinston commented Mar 16, 2023 •

edited

Loading