-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actively monitor nodes status for DaemonSet deployments #881
Conversation
0be5d4c
to
7b6ba59
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too prefer this over #878 -- Nice work finding a solution that makes Krane success detection more robust instead of less!
def find_nodes(cache) | ||
all_nodes = cache.get_all(Node.kind) | ||
all_nodes = all_nodes.select { |node_data| node_data.dig('spec', 'unschedulable').to_s.downcase != 'true' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're looping over the same data three times here. Can we do it just once instead? My ruby might be rusty, but something like:
def find_nodes(cache)
all_nodes = cache.get_all(Node.kind)
all_nodes.each_with_object([]) do |node_data, relevant_nodes|
cond = node_data.dig('status', 'conditions').find { |c| c['type'].downcase == 'ready' }
ready = cond.nil? ? true : cond['status'].downcase == 'true'
schedulable = node_data.dig('spec', 'unschedulable').to_s.downcase != 'true'
if schedulable && ready
relevant_nodes << Node.new(definition: node_data)
end
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your ruby is less rusty that mine apparently.
We definitely can.
ds = build_synced_ds(ds_template: ds_template, pod_templates: pod_templates, node_templates: node_templates) | ||
refute_predicate(ds, :deploy_succeeded?) | ||
|
||
node_added_status = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as above and we're not adding a node in this section. Unneeded/misnamed var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's expected, with have the same Pod status, but now one of the node became unschedulable. So it should not block anymore.
node_templates[2]['spec']['unschedulable'] = 'true'
"numberReady": 2, | ||
} | ||
ds_template = build_ds_template(filename: 'daemon_set.yml', status: node_added_status) | ||
pod_templates = load_fixtures(filenames: ['daemon_set_pods.yml']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, do we need to reload these for some reason or could we just use the ones above?
"numberReady": 2, | ||
} | ||
ds_template = build_ds_template(filename: 'daemon_set.yml', status: node_added_status) | ||
pod_templates = load_fixtures(filenames: ['daemon_set_pods.yml']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above about these three variables/loads
@KnVerey thanks for your feedbacks, I've update the PR accordingly. I'm still tracking down the failing test, not sure yet if it's related or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Test failure looks like an unrelated flake.
I tested it locally with a |
This PR is a completion of the original implementation in #580
In the original implementation, the list of nodes is fetched when the deployment start and then is fixed. if one of the nodes goes away / becomes unschedulable while daemonsets are rolling out, it keeps waiting indefinitely.
This PR aims to make the node list dynamic.
It also now filters out unschedulable / not ready nodes.
This should replace the
required-rollout
PR #878