Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actively monitor nodes status for DaemonSet deployments #881

Merged
merged 2 commits into from
Mar 31, 2022

Conversation

wayt
Copy link
Contributor

@wayt wayt commented Mar 30, 2022

This PR is a completion of the original implementation in #580

In the original implementation, the list of nodes is fetched when the deployment start and then is fixed. if one of the nodes goes away / becomes unschedulable while daemonsets are rolling out, it keeps waiting indefinitely.

This PR aims to make the node list dynamic.

It also now filters out unschedulable / not ready nodes.

This should replace the required-rollout PR #878

@wayt wayt requested review from dalehamel and KnVerey March 30, 2022 15:06
@wayt wayt requested a review from a team as a code owner March 30, 2022 15:06
@wayt wayt requested review from JamesOwenHall, peiranliushop, timothysmith0609 and guidowb and removed request for a team March 30, 2022 15:06
@wayt wayt changed the title WIP daemonset deploy Actively monitor nodes status for DaemonSet deployments Mar 30, 2022
Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too prefer this over #878 -- Nice work finding a solution that makes Krane success detection more robust instead of less!

def find_nodes(cache)
all_nodes = cache.get_all(Node.kind)
all_nodes = all_nodes.select { |node_data| node_data.dig('spec', 'unschedulable').to_s.downcase != 'true' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're looping over the same data three times here. Can we do it just once instead? My ruby might be rusty, but something like:

def find_nodes(cache)
  all_nodes = cache.get_all(Node.kind)
  all_nodes.each_with_object([]) do |node_data, relevant_nodes|
    cond = node_data.dig('status', 'conditions').find { |c| c['type'].downcase == 'ready' }
    ready = cond.nil? ? true : cond['status'].downcase == 'true'
    schedulable = node_data.dig('spec', 'unschedulable').to_s.downcase != 'true'
    if schedulable && ready 
      relevant_nodes << Node.new(definition: node_data)
    end
  end
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your ruby is less rusty that mine apparently.

We definitely can.

ds = build_synced_ds(ds_template: ds_template, pod_templates: pod_templates, node_templates: node_templates)
refute_predicate(ds, :deploy_succeeded?)

node_added_status = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as above and we're not adding a node in this section. Unneeded/misnamed var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's expected, with have the same Pod status, but now one of the node became unschedulable. So it should not block anymore.

node_templates[2]['spec']['unschedulable'] = 'true'

"numberReady": 2,
}
ds_template = build_ds_template(filename: 'daemon_set.yml', status: node_added_status)
pod_templates = load_fixtures(filenames: ['daemon_set_pods.yml'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, do we need to reload these for some reason or could we just use the ones above?

"numberReady": 2,
}
ds_template = build_ds_template(filename: 'daemon_set.yml', status: node_added_status)
pod_templates = load_fixtures(filenames: ['daemon_set_pods.yml'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above about these three variables/loads

@wayt
Copy link
Contributor Author

wayt commented Mar 30, 2022

@KnVerey thanks for your feedbacks, I've update the PR accordingly.

I'm still tracking down the failing test, not sure yet if it's related or not.

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Test failure looks like an unrelated flake.

@wayt
Copy link
Contributor Author

wayt commented Mar 31, 2022

I tested it locally with a ci-us-west1-1 / node-local-dns, works great!

@wayt wayt merged commit 78a0855 into master Mar 31, 2022
@wayt wayt mentioned this pull request Mar 31, 2022
@shopify-shipit shopify-shipit bot temporarily deployed to rubygems March 31, 2022 16:12 Inactive
@wayt wayt deleted the daemonset-deploy-fix branch March 31, 2022 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants