Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor]: Persist the instance ID between restarts #32618

Conversation

BinaryFissionGames
Copy link
Contributor

Description:

  • Adds the ability to persist the instance ID on disk and load it on startup.

Link to tracking Issue: Closes #21073

Testing:

  • Add unit tests for persistence functionality
  • Add e2e tests for persistence (testing that initial id is persisted, and that server specified ID is persisted)

Documentation: N/A

Comment on lines 28 to 49
// DirectoryOrDefault returns the configured storage directory if it was configured,
// otherwise it returns the system default.
func (s Storage) DirectoryOrDefault() string {
if s.Directory == "" {
switch runtime.GOOS {
case "windows":
// Windows default is "%ProgramData%\Otelcol\Supervisor"
// If the ProgramData environment variable is not set,
// it falls back to C:\ProgramData
programDataDir := os.Getenv("ProgramData")
if programDataDir == "" {
programDataDir = `C:\ProgramData`
}
return filepath.Join(programDataDir, "Otelcol", "Supervisor")
default:
// Default for non-windows systems
return "/var/lib/otelcol/supervisor"
}
}

return s.Directory
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to move this to a separate PR, I think we'll need to create these directories if they don't exist too.

Comment on lines +824 to +899
case <-s.doneChan:
err := s.commander.Stop(context.Background())
if err != nil {
s.logger.Error("Could not stop agent process", zap.Error(err))
}
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is due to a race condition I was bumping into with the e2e tests. Basically, the shutdown would happen while the command was in the middle of starting, so this goroutine would never exit (the command never exited), then Supervisor.Shutdown would never exit as a consequence.

Instead, we just handle the full lifecycle of the command all in the same goroutine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I agree with handling the full lifecycle in a single goroutine.

Comment on lines +882 to 956
s.supervisorWG.Wait()

if s.healthCheckTicker != nil {
s.healthCheckTicker.Stop()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another race condition, the health check ticker is recreated by the agent command goroutine, so we need to make sure that goroutine is exited before we try to stop it (otherwise we may leak a non-stopped ticker)

@evan-bradley
Copy link
Contributor

Thanks for opening this and fixing a few race conditions. Do you want a review while this is still a draft, or would you like to wait until it's marked "ready for review"?

Do you think there's any overlap with #30807?

@BinaryFissionGames BinaryFissionGames force-pushed the feat/opamp-supervisor-persist-instance-id branch from d9cad2c to f1508a0 Compare April 23, 2024 13:57
@BinaryFissionGames BinaryFissionGames marked this pull request as ready for review April 23, 2024 13:58
@BinaryFissionGames BinaryFissionGames requested a review from a team April 23, 2024 13:58
@BinaryFissionGames
Copy link
Contributor Author

@evan-bradley It's ready for review now, I tend to open things in draft and come back to them with a fresh mind to look back over the PR, turns out it was a good idea because I caught a few weird naming things here 😆

I do think there's a small bit of overlap in #30807 - but I think it's only in the configuration. I think it makes sense to persist those statuses in different files, just because they are raw byte payloads that probably wouldn't be so useful in a structured yaml file like this.

.chloggen/feat_opamp-supervisor-persist-instance-id.yaml Outdated Show resolved Hide resolved
cmd/opampsupervisor/supervisor/persistence.go Show resolved Hide resolved
Comment on lines +824 to +899
case <-s.doneChan:
err := s.commander.Stop(context.Background())
if err != nil {
s.logger.Error("Could not stop agent process", zap.Error(err))
}
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I agree with handling the full lifecycle in a single goroutine.

@BinaryFissionGames BinaryFissionGames force-pushed the feat/opamp-supervisor-persist-instance-id branch from 4e70be0 to c891001 Compare May 3, 2024 14:19
@BinaryFissionGames BinaryFissionGames force-pushed the feat/opamp-supervisor-persist-instance-id branch from 2672219 to b314426 Compare May 7, 2024 14:39
@bogdandrutu
Copy link
Member

@tigrannajaryan is this ready to be merged from your perspective?

@tigrannajaryan
Copy link
Member

@tigrannajaryan is this ready to be merged from your perspective?

User-visible functionality LGTM. I didn't review the code, but @evan-bradley did so I believe it should be good to merge.

@djaglowski djaglowski merged commit 125ff49 into open-telemetry:main May 16, 2024
162 checks passed
@github-actions github-actions bot added this to the next release milestone May 16, 2024
evan-bradley added a commit that referenced this pull request May 16, 2024
evan-bradley pushed a commit that referenced this pull request May 16, 2024
… failure (#33095)

**Description:** <Describe what has changed.>
fixes test failure observed in #33093 (due to merge of #32819 and
#32618)

**Link to tracking Issue:** Closes #33093

**Testing:** run supervisor e2e tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[cmd/opampsupervisor] Persist Collector instance ID
5 participants