Skip to content

Commit

Permalink
[extension/opamp]: Add mechanism to detect whether the collector has …
Browse files Browse the repository at this point in the history
…been orphaned (open-telemetry#32564)

**Description:** <Describe what has changed.>
* Allows the process to monitor a passed in ppid, which should be the
parent process ID for the collector. When the parent process ID exits,
the extension emits a fatal error event, which triggers a collector
shutdown.

**Link to tracking Issue:** This is part of open-telemetry#32189 - It does not resolve
this issue because the supervisor still needs changes to pass the its
PID in.

**Testing:**
Added some unit tests.

I've manually tested it on my macbook with this PR:
observIQ#4715
(running supervisor, kill -9 the supervisor, and take a look at the
agent logs to see it shut down).

I've tried testing this out on Windows, but the supervisor doesn't get
past bootstrapping (the Commander's Stop function does not work on
windows), so I wasn't able to fully test it.

**Documentation:** 
Added new parameter to README
  • Loading branch information
BinaryFissionGames authored and cparkins committed Jul 11, 2024
1 parent 95749ce commit 4d7aba9
Show file tree
Hide file tree
Showing 13 changed files with 254 additions and 23 deletions.
13 changes: 13 additions & 0 deletions .chloggen/feat_opamp-extension-monitor-ppid.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: bug_fix

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: opampextension

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Add a new `ppid` parameter that can be used to enable orphan detection for the supervisor.

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [32189]
4 changes: 2 additions & 2 deletions cmd/otelcontribcol/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -648,8 +648,8 @@ require (
github.com/tidwall/wal v1.1.7 // indirect
github.com/tilinna/clock v1.1.0 // indirect
github.com/tinylib/msgp v1.1.9 // indirect
github.com/tklauser/go-sysconf v0.3.12 // indirect
github.com/tklauser/numcpus v0.6.1 // indirect
github.com/tklauser/go-sysconf v0.3.14 // indirect
github.com/tklauser/numcpus v0.8.0 // indirect
github.com/valyala/fastjson v1.6.4 // indirect
github.com/vincent-petithory/dataurl v1.0.0 // indirect
github.com/vishvananda/netlink v1.1.1-0.20201029203352-d40f9887b852 // indirect
Expand Down
6 changes: 4 additions & 2 deletions cmd/otelcontribcol/go.sum

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions extension/opampextension/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ The following settings are optional:
- `reports_effective_config`: Whether to enable the OpAMP ReportsEffectiveConfig capability. Default is `true`.
- `agent_description`: Setting that modifies the agent description reported to the OpAMP server.
- `non_identifying_attributes`: A map of key value pairs that will be added to the [non-identifying attributes](https://github.com/open-telemetry/opamp-spec/blob/main/specification.md#agentdescriptionnon_identifying_attributes) reported to the OpAMP server. If an attribute collides with the default non-identifying attributes that are automatically added, the ones specified here take precedence.
- `ppid`: An optional process ID to monitor. When this process is no longer running, the extension will emit a fatal error, causing the collector to exit. This is meant to be set by the Supervisor or some other parent process, and should not be configured manually.
- `ppid_poll_interval`: The poll interval between check for whether `ppid` is still alive or not. Defaults to 5 seconds.

### Example

Expand Down
10 changes: 10 additions & 0 deletions extension/opampextension/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ package opampextension // import "github.com/open-telemetry/opentelemetry-collec
import (
"errors"
"net/url"
"time"

"github.com/oklog/ulid/v2"
"github.com/open-telemetry/opamp-go/client"
Expand All @@ -29,6 +30,15 @@ type Config struct {

// Agent descriptions contains options to modify the AgentDescription message
AgentDescription AgentDescription `mapstructure:"agent_description"`

// PPID is the process ID of the parent for the collector. If the PPID is specified,
// the extension will continuously poll for the status of the parent process, and emit a fatal error
// when the parent process is no longer running.
// If unspecified, the orphan detection logic does not run.
PPID int32 `mapstructure:"ppid"`

// PPIDPollInterval is the time between polling for whether PPID is running.
PPIDPollInterval time.Duration `mapstructure:"ppid_poll_interval"`
}

type AgentDescription struct {
Expand Down
3 changes: 3 additions & 0 deletions extension/opampextension/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ package opampextension
import (
"path/filepath"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
Expand Down Expand Up @@ -40,6 +41,7 @@ func TestUnmarshalConfig(t *testing.T) {
Capabilities: Capabilities{
ReportsEffectiveConfig: true,
},
PPIDPollInterval: 5 * time.Second,
}, cfg)
}

Expand All @@ -60,6 +62,7 @@ func TestUnmarshalHttpConfig(t *testing.T) {
Capabilities: Capabilities{
ReportsEffectiveConfig: true,
},
PPIDPollInterval: 5 * time.Second,
}, cfg)
}

Expand Down
4 changes: 3 additions & 1 deletion extension/opampextension/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ package opampextension // import "github.com/open-telemetry/opentelemetry-collec

import (
"context"
"time"

"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/extension"
Expand All @@ -27,9 +28,10 @@ func createDefaultConfig() component.Config {
Capabilities: Capabilities{
ReportsEffectiveConfig: true,
},
PPIDPollInterval: 5 * time.Second,
}
}

func createExtension(_ context.Context, set extension.CreateSettings, cfg component.Config) (extension.Extension, error) {
return newOpampAgent(cfg.(*Config), set.Logger, set.BuildInfo, set.Resource)
return newOpampAgent(cfg.(*Config), set)
}
10 changes: 9 additions & 1 deletion extension/opampextension/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ require (
github.com/google/uuid v1.6.0
github.com/oklog/ulid/v2 v2.1.0
github.com/open-telemetry/opamp-go v0.14.0
github.com/shirou/gopsutil/v3 v3.24.3
github.com/stretchr/testify v1.9.0
go.opentelemetry.io/collector/component v0.99.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/collector/config/configopaque v1.6.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/collector/config/configtls v0.99.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/collector/confmap v0.99.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/collector/extension v0.99.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/collector/pdata v1.6.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/collector/semconv v0.99.1-0.20240503164040-109173d9cf84
go.opentelemetry.io/otel/metric v1.26.0
go.opentelemetry.io/otel/trace v1.26.0
Expand All @@ -30,21 +30,29 @@ require (
github.com/fsnotify/fsnotify v1.7.0 // indirect
github.com/go-logr/logr v1.4.1 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/go-ole/go-ole v1.2.6 // indirect
github.com/go-viper/mapstructure/v2 v2.0.0-alpha.1 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/gorilla/websocket v1.5.1 // indirect
github.com/knadh/koanf/maps v0.1.1 // indirect
github.com/knadh/koanf/providers/confmap v0.1.0 // indirect
github.com/knadh/koanf/v2 v2.1.1 // indirect
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 // indirect
github.com/mitchellh/copystructure v1.2.0 // indirect
github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c // indirect
github.com/prometheus/client_golang v1.19.0 // indirect
github.com/prometheus/client_model v0.6.1 // indirect
github.com/prometheus/common v0.53.0 // indirect
github.com/prometheus/procfs v0.12.0 // indirect
github.com/rogpeppe/go-internal v1.11.0 // indirect
github.com/shoenig/go-m1cpu v0.1.6 // indirect
github.com/tklauser/go-sysconf v0.3.14 // indirect
github.com/tklauser/numcpus v0.8.0 // indirect
github.com/yusufpapurcu/wmi v1.2.4 // indirect
go.opentelemetry.io/collector/config/configtelemetry v0.99.1-0.20240503164040-109173d9cf84 // indirect
go.opentelemetry.io/collector/pdata v1.6.1-0.20240503164040-109173d9cf84 // indirect
go.opentelemetry.io/otel v1.26.0 // indirect
go.opentelemetry.io/otel/exporters/prometheus v0.48.0 // indirect
go.opentelemetry.io/otel/sdk v1.26.0 // indirect
Expand Down
38 changes: 38 additions & 0 deletions extension/opampextension/go.sum

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

40 changes: 40 additions & 0 deletions extension/opampextension/monitor_ppid.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package opampextension // import "github.com/open-telemetry/opentelemetry-collector-contrib/extension/opampextension"

import (
"context"
"fmt"
"time"

"github.com/shirou/gopsutil/v3/process"
"go.opentelemetry.io/collector/component"
)

// monitorPPID polls for the existence of ppid.
// If the specified ppid no longer exists, a fatal error event is reported via the passed in reportStatus function.
func monitorPPID(ctx context.Context, interval time.Duration, ppid int32, reportStatus func(*component.StatusEvent)) {
for {
exists, err := process.PidExistsWithContext(ctx, ppid)
if err != nil {
statusErr := fmt.Errorf("collector was orphaned, failed to find process with pid %d: %w", ppid, err)
status := component.NewFatalErrorEvent(statusErr)
reportStatus(status)
return
}

if !exists {
statusErr := fmt.Errorf("collector was orphaned, process with pid %d does not exist", ppid)
status := component.NewFatalErrorEvent(statusErr)
reportStatus(status)
return
}

select {
case <-time.After(interval): // OK; Poll again to make sure PID exists
case <-ctx.Done():
return
}
}
}
Loading

0 comments on commit 4d7aba9

Please sign in to comment.