Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

Merged
merged 99 commits into from
Jun 13, 2024

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 21, 2024

Description: Creates new sampler modes named "equalizing" and "proportional". Preserves existing functionality under the mode named "hash_seed".

Fixes #31918

This is the final step in a sequence, the whole of this work was factored into 3+ PRs, including the new pkg/sampling and the previous step, #31946. The two new Sampler modes enable mixing OTel sampling SDKs with Collectors in a consistent way.

The existing hash_seed mode is also a consistent sampling mode, which makes it possible to have a 1:1 mapping between its decisions and the OTEP 235 randomness and threshold values. Specifically, the 14-bit hash value and sampling probability are mapped into 56-bit R-value and T-value encodings, so that all sampling decisions in all modes include threshold information.

This implements the semantic conventions of open-telemetry/semantic-conventions#793, namely the sampling.randomness and sampling.threshold attributes used for logs where there is no tracestate.

The default sampling mode remains HashSeed. We consider a future change of default to Proportional to be desirable, because:

  1. Sampling probability is the same, only the hashing algorithm changes
  2. Proportional respects and preserves information about earlier sampling decisions, which HashSeed can't do, so it has greater interoperability with OTel SDKs which may also adopt OTEP 235 samplers.

Link to tracking Issue:

Draft for open-telemetry/opentelemetry-specification#3602.
Previously #24811, see also open-telemetry/oteps#235
Part of #29738

Testing: New testing has been added.

Documentation:

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from small details, I think this looks good. I would also ask you to think about which telemetry you'd want from this component in case of a bug report, and add metrics for them. Do we need to keep a histogram of Rs and Ts?

processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/sampler_mode.go Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/sampler_mode.go Outdated Show resolved Hide resolved
logger.Debug(description, zap.Error(err))
var se samplerError
if errors.As(err, &se) {
logger.Info(description, zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of the hot path, I'd rather have a metric recording the number of errors that happened, with samplerError being a dimension of the metric (error=sampler, error=other).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me suggest something different: let them all be (debug) logs for this PR, but then convert the metrics for this component as a whole from OC to OTel, using mdatagen. On that PR, you could add the new metric for the error conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of making the logs into Debug() and Info()-level. I also do not expect any of the Info()-level log statements to be common, for they require a misbehaving SDK or some other form of data corruption, as opposed to misconfigured sampling. Because these messages signal corruption of some kind, I think they should be logs. My expectation is that logs are sampled at runtime (or they should be) to avoid impacting the hot path.

With that said, I feel that a new metric is not helpful -- it just means the customer has to monitor something new when we have metric signals already. There is a single metric output by this code specifically, which will have a "policy=missing_randomness" when these errors arise.

We also have (or at least desire) standard pipeline metrics, which ought to provide a standard way to count how many spans succeed or fail. If the sampler is configured with FailClosed=true and these missing_randomness conditions are happening, the result will be loss of spans. I do not want the user to have to discover a new metric for this, because there ought to be a standard metric for rejected items. All processors should be able to count the number of items that are rejected for malformed data.

if errors.Is(err, sampling.ErrInconsistentSampling) {
// This is working-as-intended. You can't lower
// the threshold, it's illogical.
logger.Debug(description, zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about the metric vs. log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize: I think the existing metric's "policy=missing_randomness" is a useful-enough signal. Personally, I want standard pipeline metrics so that every component doesn't have to invent a bespoke metric definition.

}
}
if err := carrier.reserialize(); err != nil {
logger.Info(description, zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and same here :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some sort of grievous corruption, and I personally do not want every component inventing new metrics to monitor for things we never expect to happen.

Copy link
Contributor Author

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(partial)

processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
@@ -18,6 +18,7 @@ require (
go.opentelemetry.io/otel/metric v1.27.0
go.opentelemetry.io/otel/trace v1.27.0
go.uber.org/goleak v1.3.0
go.uber.org/multierr v1.11.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now using errors.Join. 95ecbae

processor/probabilisticsamplerprocessor/sampler_mode.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member

@jmacd, please let me know once this is ready for another round.

jmacd and others added 2 commits June 11, 2024 08:27
@jmacd
Copy link
Contributor Author

jmacd commented Jun 11, 2024

I would also ask you to think about which telemetry you'd want from this component in case of a bug report, and add metrics for them. Do we need to keep a histogram of Rs and Ts?

My personal opinion is that we should not invent new metrics to monitor for bugs, that is what logs are good at and if we think logs do not perform well enough, we should reconsider -- metrics are not clearly better for performance, compared with sampled logs.

Moreover, I want us to encourage use of standard pipeline metrics. Practically all real processors are going to encounter errors that would cause data to be dropped or malformed in some way, and we shouldn't need new metrics for every one of them.

For your question about histograms of R and T; there is something I would recommend, but not a histogram of T or R values (and both of these could be high cardinality), and probably not at default-verbosity level. What we do care about, and I'm open to suggestions, is that the sum of adjusted counts after sampling is expected to equal the sum of adjusted counts before sampling. This is a probabilistic argument, so the match is not exact. We should have a metric that counts items by their adjusted count. I would argue that such a metric should be standardized and comparable with standard pipeline metrics, so if otelcol_incoming_items and otelcol_outgoing_items are the standard pipeline metrics, sampling processors could emit otelcol_outgoing_items_adjusted and otel_incoming_items_adjusted. There would be some additional expense and code complexity to this.

Another direction to take this question is that the span-to-metrics connector should be able to use the adjusted counts so that it counts the number of representative spans, not the number of actual spans. This sounds more useful to me than a pipeline metric of adjusted count, and anyway I do not prefer to use metrics as a debugging signal. If there are bugs, I would recommend users connect a debugging exporter and review the output data.

@jmacd
Copy link
Contributor Author

jmacd commented Jun 12, 2024

These test failures are independent, #33520.

Comment on lines +440 to +454
// Convert the accept threshold to a reject threshold,
// then shift it into 56-bit value.
reject := numHashBuckets - scaledSamplerate
reject56 := uint64(reject) << 42

threshold, _ := sampling.UnsignedToThreshold(reject56)
threshold, _ := sampling.UnsignedToThreshold(reject56)

return &hashingSampler{
tvalueThreshold: threshold,
hashSeed: cfg.HashSeed,
return &hashingSampler{
tvalueThreshold: threshold,
hashSeed: cfg.HashSeed,

// Logs specific:
logsTraceIDEnabled: cfg.AttributeSource == traceIDAttributeSource,
logsRandomnessSourceAttribute: cfg.FromAttribute,
// Logs specific:
logsTraceIDEnabled: cfg.AttributeSource == traceIDAttributeSource,
logsRandomnessSourceAttribute: cfg.FromAttribute,
}
Copy link
Contributor Author

@jmacd jmacd Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampler SIG, see here. Reference: open-telemetry/semantic-conventions#793 (comment)


### Equalizing

This mode uses the same randomness mechanism as the propotional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This mode uses the same randomness mechanism as the propotional
This mode uses the same randomness mechanism as the proportional

@@ -105,16 +106,16 @@ func Test_tracesamplerprocessor_SamplingPercentageRange(t *testing.T) {
},
numBatches: 1e5,
numTracesPerBatch: 2,
acceptableDelta: 0.01,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these adjustments are to get a faster but less-flaky test, what do you think of the idea of adding a loop? You can try a few times to get a result in the acceptable range; it will finish as soon as it sees an acceptable result.

This is the idea behind the "Eventually" function in the stretchr/testify library (we don't use it here but the concept is sound).

- `sampling_percentage` (32-bit floating point, required): Percentage at which items are sampled; >= 100 samples all items, 0 rejects all items.
- `hash_seed` (32-bit unsigned integer, optional, default = 0): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
- `fail_closed` (boolean, optional, default = true): Whether to reject items with sampling-related errors.
- `sampling_precision` (integer, optional, default = 4): Determines the number of hexadecimal digits used to encode the sampling threshold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include the range of valid values here (1-14)?

@@ -45,6 +75,14 @@ type Config struct {
// despite errors using priority.
FailClosed bool `mapstructure:"fail_closed"`

// SamplingPrecision is how many hex digits of sampling
// threshold will be encoded, from 1 up to 14. Default is 4.
// 0 is treated as full precision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not according to invalid_zero.yaml.

@@ -230,46 +376,82 @@ func consistencyCheck(rnd randomnessNamer, _ samplingCarrier) error {
//
// Extending this logic, we round very small probabilities up to the
// minimum supported value(s) which varies according to sampler mode.
func makeSampler(cfg *Config) dataSampler {
func makeSampler(cfg *Config, isLogs bool) dataSampler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an explanation of 'isLogs' to the comment

Copy link
Member

@kentquirk kentquirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mainly cosmetic; in general, we need this; approving to start to push the train out of the station.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that older and newer collectors can co-exist and come to the same decisions, I think this is ready to go. The logs vs. metrics matter can be addressed later, and changed if we think it's easier to respond to possible bug reports.

@jpkrohling jpkrohling merged commit 9f0a3db into open-telemetry:main Jun 13, 2024
154 checks passed
@github-actions github-actions bot added this to the next release milestone Jun 13, 2024
t00mas pushed a commit to t00mas/opentelemetry-collector-contrib that referenced this pull request Jun 18, 2024
…rt OTEP 235) (open-telemetry#31894)

**Description:** Creates new sampler modes named "equalizing" and
"proportional". Preserves existing functionality under the mode named
"hash_seed".

Fixes open-telemetry#31918

This is the final step in a sequence, the whole of this work was
factored into 3+ PRs, including the new `pkg/sampling` and the previous
step, open-telemetry#31946. The two new Sampler modes enable mixing OTel sampling SDKs
with Collectors in a consistent way.

The existing hash_seed mode is also a consistent sampling mode, which
makes it possible to have a 1:1 mapping between its decisions and the
OTEP 235 randomness and threshold values. Specifically, the 14-bit hash
value and sampling probability are mapped into 56-bit R-value and
T-value encodings, so that all sampling decisions in all modes include
threshold information.

This implements the semantic conventions of
open-telemetry/semantic-conventions#793, namely
the `sampling.randomness` and `sampling.threshold` attributes used for
logs where there is no tracestate.

The default sampling mode remains HashSeed. We consider a future change
of default to Proportional to be desirable, because:

1. Sampling probability is the same, only the hashing algorithm changes
2. Proportional respects and preserves information about earlier
sampling decisions, which HashSeed can't do, so it has greater
interoperability with OTel SDKs which may also adopt OTEP 235 samplers.

**Link to tracking Issue:** 

Draft for
open-telemetry/opentelemetry-specification#3602.
Previously
open-telemetry#24811,
see also open-telemetry/oteps#235
Part of open-telemetry#29738

**Testing:** New testing has been added.

**Documentation:** ✅

---------

Co-authored-by: Juraci Paixão Kröhling <[email protected]>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this pull request Jul 11, 2024
…ation, prepare for OTEP 235 support (open-telemetry#31946)

**Description:**

Refactors the probabilistic sampling processor to prepare it for more
OTEP 235 support.

This clarifies existing inconsistencies between tracing and logging
samplers, see the updated README. The tracing priority mechanism applies
a 0% or 100% sampling override (e.g., "1" implies 100% sampling),
whereas the logging sampling priority mechanism supports
variable-probability override (e.g., "1" implies 1% sampling).

This pins down cases where no randomness is available, and organizes the
code to improve readability. A new type called `randomnessNamer` carries
the randomness information (from the sampling pacakge) and a name of the
policy that derived it. When sampling priority causes the effective
sampling probability to change, the value "sampling.priority" replaces
the source of randomness, which is currently limited to "trace_id_hash"
or the name of the randomess-source attribute, for logs.

While working on open-telemetry#31894, I discovered that some inputs fall through to
the hash function with zero bytes of input randomness. The hash
function, computed on an empty input (for logs) or on 16 bytes of zeros
(which OTel calls an invalid trace ID), would produce a fixed random
value. So, for example, when logs are sampled and there is no TraceID
and there is no randomness attribute value, the result will be sampled
at approximately 82.9% and above.

In the refactored code, an error is returned when there is no input
randomness. A new boolean configuration field determines the outcome
when there is an error extracting randomness from an item of telemetry.
By default, items of telemetry with errors will not pass through the
sampler. When `FailClosed` is set to false, items of telemetry with
errors will pass through the sampler.

The original hash function, which uses 14 bits of information, is
structured as an "acceptance threshold", ultimately the test for
sampling translated into a positive decision when `Randomness <
AcceptThreshold`. In the OTEP 235 scheme, thresholds are rejection
thresholds--this PR modifies the original 14-bit accept threshold into a
56-bit reject threshold, using Threshold and Randomness types from the
sampling package. Reframed in this way, in the subsequent PR (i.e.,
open-telemetry#31894) the effective sampling probability will be seamlessly conveyed
using OTEP 235 semantic conventions.

Note, both traces and logs processors are now reduced to a function like
this:

```
				return commonSamplingLogic(
					ctx,
					l,
					lsp.sampler,
					lsp.failClosed,
					lsp.sampler.randomnessFromLogRecord,
					lsp.priorityFunc,
					"logs sampler",
					lsp.logger,
				)
```

which is a generic function that handles the common logic on a per-item
basis and ends in a single metric event. This structure makes it clear
how traces and logs are processed differently and have different
prioritization schemes, currently. This structure also makes it easy to
introduce new sampler modes, as shown in open-telemetry#31894. After this and open-telemetry#31940
merge, the changes in open-telemetry#31894 will be relatively simple to review as the
third part in a series.

**Link to tracking Issue:**

Depends on open-telemetry#31940.
Part of open-telemetry#31918.

**Testing:** Added. Existing tests already cover the exact random
behavior of the current hashing mechanism. Even more testing will be
introduced with the last step of this series. Note that
open-telemetry#32360
is added ahead of this test to ensure refactoring does not change
results.

**Documentation:** Added.

---------

Co-authored-by: Kent Quirk <[email protected]>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this pull request Jul 11, 2024
…rt OTEP 235) (open-telemetry#31894)

**Description:** Creates new sampler modes named "equalizing" and
"proportional". Preserves existing functionality under the mode named
"hash_seed".

Fixes open-telemetry#31918

This is the final step in a sequence, the whole of this work was
factored into 3+ PRs, including the new `pkg/sampling` and the previous
step, open-telemetry#31946. The two new Sampler modes enable mixing OTel sampling SDKs
with Collectors in a consistent way.

The existing hash_seed mode is also a consistent sampling mode, which
makes it possible to have a 1:1 mapping between its decisions and the
OTEP 235 randomness and threshold values. Specifically, the 14-bit hash
value and sampling probability are mapped into 56-bit R-value and
T-value encodings, so that all sampling decisions in all modes include
threshold information.

This implements the semantic conventions of
open-telemetry/semantic-conventions#793, namely
the `sampling.randomness` and `sampling.threshold` attributes used for
logs where there is no tracestate.

The default sampling mode remains HashSeed. We consider a future change
of default to Proportional to be desirable, because:

1. Sampling probability is the same, only the hashing algorithm changes
2. Proportional respects and preserves information about earlier
sampling decisions, which HashSeed can't do, so it has greater
interoperability with OTel SDKs which may also adopt OTEP 235 samplers.

**Link to tracking Issue:** 

Draft for
open-telemetry/opentelemetry-specification#3602.
Previously
open-telemetry#24811,
see also open-telemetry/oteps#235
Part of open-telemetry#29738

**Testing:** New testing has been added.

**Documentation:** ✅

---------

Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg/sampling processor/probabilisticsampler Probabilistic Sampler processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update probabilistic sampler processor with OTEP 235 support
4 participants