-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI volume bugs in Docker Swarm #47974
Comments
A possible solution to the bit shifting bug above without repeated calculations or magic values is below: waitFor = baseRetryInterval << attempt
if waitFor > maxRetryInterval || waitFor < baseRetryInterval {
waitFor = maxRetryInterval
} In this solution |
Thank you for debugging this. This is a nightmare bug, I don't know that I would have ever caught it. |
I have found a third bug that is due to missing internal state in volume publishing and unpublishing, when controller client is unset here permalink: // PublishVolume calls ControllerPublishVolume to publish the given Volume to
// the Node with the given swarmkit ID. It returns a map, which is the
// PublishContext for this Volume on this Node.
func (p *plugin) PublishVolume(ctx context.Context, v *api.Volume, nodeID string) (map[string]string, error) {
if !p.publisher {
return nil, nil
}
csiNodeID := p.swarmToCSI[nodeID]
if csiNodeID == "" {
log.L.Errorf("CSI node ID not found for given Swarm node ID. Plugin: %s , Swarm node ID: %s", p.name, nodeID)
return nil, status.Error(codes.FailedPrecondition, "CSI node ID not found for given Swarm node ID")
}
req := p.makeControllerPublishVolumeRequest(v, nodeID)
c, err := p.Client(ctx)
if err != nil {
return nil, err
}
resp, err := c.ControllerPublishVolume(ctx, req)
if err != nil {
return nil, err
}
return resp.PublishContext, nil
}
// UnpublishVolume calls ControllerUnpublishVolume to unpublish the given
// Volume from the Node with the given swarmkit ID. It returns an error if the
// unpublish does not succeed
func (p *plugin) UnpublishVolume(ctx context.Context, v *api.Volume, nodeID string) error {
if !p.publisher {
return nil
}
req := p.makeControllerUnpublishVolumeRequest(v, nodeID)
c, err := p.Client(ctx)
if err != nil {
return err
}
// response of the RPC intentionally left blank
_, err = c.ControllerUnpublishVolume(ctx, req)
return err
} On the first call to both of these methods, the value of |
I have found a fourth bug that is due to a race condition and is a little harder to reason about. It happens when the I was able to determine how this happens by logging changes in internal state in the task scheduler and found the specific area of concern in the scheduler event loop permalink: schedule := func() {
if len(s.pendingPreassignedTasks) > 0 {
s.processPreassignedTasks(ctx)
}
if tickRequired {
s.tick(ctx)
tickRequired = false
}
} I believe that tasks that are scheduled within schedule := func() {
if tickRequired {
s.tick(ctx)
tickRequired = false
}
if len(s.pendingPreassignedTasks) > 0 {
s.processPreassignedTasks(ctx)
}
} This makes intuitive sense that the current batch of tasks should be committed as soon as possible in the event loop. |
Description
I have found a few critical bugs while testing the CSI volumes feature. The first bug results in an infinite loop that causes dockerd logs to fill up the local disk:
This is due to the following code for calculating a timer in the volume queue, where the bit shifting calculation can overflow permalink:
For example, see tests of the following
waitFor
calculations with higher attempts:To fix this, we could check against a maximum number of
attempt
such as 16, 32, or even calculate the exactattempt
that exceedsmaxRetryInterval
.The second bug relates to early returns in the CSI node plugin code that causes subsequent logic to be skipped:
In addition, this early return In the same file:
I have already opened a PR for the second bug here moby/swarmkit#3178.
Reproduce
Expected behavior
Bugs for CSI volumes fixed, so swarm is able to handle CSI volume drivers with potential errors gracefully.
docker version
Client: Docker Engine - Community Version: 24.0.7 API version: 1.43 Go version: go1.20.10 Git commit: afdd53b Built: Thu Oct 26 09:08:17 2023 OS/Arch: linux/amd64 Context: default Server: Docker Engine - Community Engine: Version: 24.0.7 API version: 1.43 (minimum version 1.12) Go version: go1.20.10 Git commit: 311b9ff Built: Thu Oct 26 09:08:17 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.25 GitCommit: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f runc: Version: 1.1.10 GitCommit: v1.1.10-0-g18a0cb0 docker-init: Version: 0.19.0 GitCommit: de40ad0
docker info
Additional Info
The CSI volume plugin supports staging and the node agent is calling
NodeUnpublishVolume
andNodeUnstageVolume
.The text was updated successfully, but these errors were encountered: