Status | |
---|---|
Stability | beta: traces |
Distributions | contrib, aws, grafana, observiq, splunk, sumo |
Issues | |
Code Owners | @jpkrohling |
The tail sampling processor samples traces based on a set of defined policies. All spans for a given trace MUST be received by the same collector instance for effective sampling decisions.
Before performing sampling, spans will be grouped by trace_id
. Therefore, the tail sampling processor can be used directly without the need for the groupbytraceprocessor
.
This processor must be placed in pipelines after any processors that rely on context, e.g. k8sattributes
. It reassembles spans into new batches, causing them to lose their original context.
Please refer to config.go for the config spec.
The following configuration options are required:
policies
(no default): Policies used to make a sampling decision
Multiple policies exist today and it is straight forward to add more. These include:
always_sample
: Sample all traceslatency
: Sample based on the duration of the trace. The duration is determined by looking at the earliest start time and latest end time, without taking into consideration what happened in between. Supplying no upper bound will result in a policy sampling anything greater thanthreshold_ms
.numeric_attribute
: Sample based on number attributes (resource and record)probabilistic
: Sample a percentage of traces. Read a comparison with the Probabilistic Sampling Processor.status_code
: Sample based upon the status code (OK
,ERROR
orUNSET
)string_attribute
: Sample based on string attributes (resource and record) value matches, both exact and regex value matches are supportedtrace_state
: Sample based on TraceState value matchesrate_limiting
: Sample based on ratespan_count
: Sample based on the minimum and/or maximum number of spans, inclusive. If the sum of all spans in the trace is outside the range threshold, the trace will not be sampled.boolean_attribute
: Sample based on boolean attribute (resource and record).ottl_condition
: Sample based on given boolean OTTL condition (span and span event).and
: Sample based on multiple policies, creates an AND policycomposite
: Sample based on a combination of above samplers, with ordering and rate allocation per sampler. Rate allocation allocates certain percentages of spans per policy order. For example if we have set max_total_spans_per_second as 100 then we can set rate_allocation as follows- test-composite-policy-1 = 50 % of max_total_spans_per_second = 50 spans_per_second
- test-composite-policy-2 = 25 % of max_total_spans_per_second = 25 spans_per_second
- To ensure remaining capacity is filled use always_sample as one of the policies
The following configuration options can also be modified:
decision_wait
(default = 30s): Wait time since the first span of a trace before making a sampling decisionnum_traces
(default = 50000): Number of traces kept in memory.expected_new_traces_per_sec
(default = 0): Expected number of new traces (helps in allocating data structures)
Each policy will result in a decision, and the processor will evaluate them to make a final decision:
- When there's an "inverted not sample" decision, the trace is not sampled;
- When there's a "sample" decision, the trace is sampled;
- When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;
- In all other cases, the trace is NOT sampled
An "inverted" decision is the one made based on the "invert_match" attribute, such as the one from the string tag policy.
Examples:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
[
{
name: test-policy-1,
type: always_sample
},
{
name: test-policy-2,
type: latency,
latency: {threshold_ms: 5000, upper_threshold_ms: 10000}
},
{
name: test-policy-3,
type: numeric_attribute,
numeric_attribute: {key: key1, min_value: 50, max_value: 100}
},
{
name: test-policy-4,
type: probabilistic,
probabilistic: {sampling_percentage: 10}
},
{
name: test-policy-5,
type: status_code,
status_code: {status_codes: [ERROR, UNSET]}
},
{
name: test-policy-6,
type: string_attribute,
string_attribute: {key: key2, values: [value1, value2]}
},
{
name: test-policy-7,
type: string_attribute,
string_attribute: {key: key2, values: [value1, val*], enabled_regex_matching: true, cache_max_size: 10}
},
{
name: test-policy-8,
type: rate_limiting,
rate_limiting: {spans_per_second: 35}
},
{
name: test-policy-9,
type: string_attribute,
string_attribute: {key: http.url, values: [\/health, \/metrics], enabled_regex_matching: true, invert_match: true}
},
{
name: test-policy-10,
type: span_count,
span_count: {min_spans: 2, max_spans: 20}
},
{
name: test-policy-11,
type: trace_state,
trace_state: { key: key3, values: [value1, value2] }
},
{
name: test-policy-12,
type: boolean_attribute,
boolean_attribute: {key: key4, value: true}
},
{
name: test-policy-13,
type: ottl_condition,
ottl_condition: {
error_mode: ignore,
span: [
"attributes[\"test_attr_key_1\"] == \"test_attr_val_1\"",
"attributes[\"test_attr_key_2\"] != \"test_attr_val_1\"",
],
spanevent: [
"name != \"test_span_event_name\"",
"attributes[\"test_event_attr_key_2\"] != \"test_event_attr_val_1\"",
]
}
},
{
name: and-policy-1,
type: and,
and: {
and_sub_policy:
[
{
name: test-and-policy-1,
type: numeric_attribute,
numeric_attribute: { key: key1, min_value: 50, max_value: 100 }
},
{
name: test-and-policy-2,
type: string_attribute,
string_attribute: { key: key2, values: [ value1, value2 ] }
},
]
}
},
{
name: composite-policy-1,
type: composite,
composite:
{
max_total_spans_per_second: 1000,
policy_order: [test-composite-policy-1, test-composite-policy-2, test-composite-policy-3],
composite_sub_policy:
[
{
name: test-composite-policy-1,
type: numeric_attribute,
numeric_attribute: {key: key1, min_value: 50, max_value: 100}
},
{
name: test-composite-policy-2,
type: string_attribute,
string_attribute: {key: key2, values: [value1, value2]}
},
{
name: test-composite-policy-3,
type: always_sample
}
],
rate_allocation:
[
{
policy: test-composite-policy-1,
percent: 50
},
{
policy: test-composite-policy-2,
percent: 25
}
]
}
},
]
Refer to tail_sampling_config.yaml for detailed examples on using the processor.
Imagine that you wish to configure the processor to implement the following rules:
-
Rule 1: Not all teams are ready to move to tail sampling. Therefore, sample all traces that are not from the team
team_a
. -
Rule 2: Sample only 0.1 percent of Readiness/liveness probes
-
Rule 3:
service-1
has a noisy endpoint/v1/name/{id}
. Sample only 1 percent of such traces. -
Rule 4: Other traces from
service-1
should be sampled at 100 percent. -
Rule 5: Sample all traces if there is an error in any span in the trace.
-
Rule 6: Add an escape hatch. If there is an attribute called
app.force_sample
in the span, then sample the trace at 100 percent.
Here is what the configuration would look like:
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies: [
{
# Rule 1: use always_sample policy for services that don't belong to team_a and are not ready to use tail sampling
name: backwards-compatibility-policy,
type: and,
and:
{
and_sub_policy:
[
{
name: services-using-tail_sampling-policy,
type: string_attribute,
string_attribute:
{
key: service.name,
values:
[
list,
of,
services,
using,
tail_sampling,
],
invert_match: true,
},
},
{ name: sample-all-policy, type: always_sample },
],
},
},
# BEGIN: policies for team_a
{
# Rule 2: low sampling for readiness/liveness probes
name: team_a-probe,
type: and,
and:
{
and_sub_policy:
[
{
# filter by service name
name: service-name-policy,
type: string_attribute,
string_attribute:
{
key: service.name,
values: [service-1, service-2, service-3],
},
},
{
# filter by route
name: route-live-ready-policy,
type: string_attribute,
string_attribute:
{
key: http.route,
values: [/live, /ready],
enabled_regex_matching: true,
},
},
{
# apply probabilistic sampling
name: probabilistic-policy,
type: probabilistic,
probabilistic: { sampling_percentage: 0.1 },
},
],
},
},
{
# Rule 3: low sampling for a noisy endpoint
name: team_a-noisy-endpoint-1,
type: and,
and:
{
and_sub_policy:
[
{
name: service-name-policy,
type: string_attribute,
string_attribute:
{ key: service.name, values: [service-1] },
},
{
# filter by route
name: route-name-policy,
type: string_attribute,
string_attribute:
{
key: http.route,
values: [/v1/name/.+],
enabled_regex_matching: true,
},
},
{
# apply probabilistic sampling
name: probabilistic-policy,
type: probabilistic,
probabilistic: { sampling_percentage: 1 },
},
],
},
},
{
# Rule 4: high sampling for other endpoints
name: team_a-service-1,
type: and,
and:
{
and_sub_policy:
[
{
name: service-name-policy,
type: string_attribute,
string_attribute:
{ key: service.name, values: [service-1] },
},
{
# invert match - apply to all routes except the ones specified
name: route-name-policy,
type: string_attribute,
string_attribute:
{
key: http.route,
values: [/v1/name/.+],
enabled_regex_matching: true,
invert_match: true,
},
},
{
# apply probabilistic sampling
name: probabilistic-policy,
type: probabilistic,
probabilistic: { sampling_percentage: 100 },
},
],
},
},
{
# Rule 5: always sample if there is an error
name: team_a-status-policy,
type: and,
and:
{
and_sub_policy:
[
{
name: service-name-policy,
type: string_attribute,
string_attribute:
{
key: service.name,
values:
[
list,
of,
services,
using,
tail_sampling,
],
},
},
{
name: trace-status-policy,
type: status_code,
status_code: { status_codes: [ERROR] },
},
],
},
},
{
# Rule 6:
# always sample if the force_sample attribute is set to true
name: team_a-force-sample,
type: boolean_attribute,
boolean_attribute: { key: app.force_sample, value: true },
},
# END: policies for team_a
]
This processor requires all spans for a given trace to be sent to the same collector instance for the correct sampling decision to be derived. When scaling the collector, you'll then need to ensure that all spans for the same trace are reaching the same collector. You can achieve this by having two layers of collectors in your infrastructure: one with the load balancing exporter, and one with the tail sampling processor.
While it's technically possible to have one layer of collectors with two pipelines on each instance, we recommend separating the layers in order to have better failure isolation.
Probabilistic Sampling Processor compared to the Tail Sampling Processor with the Probabilistic policy
The probabilistic sampling processor and the probabilistic tail sampling processor policy work very similar: based upon a configurable sampling percentage they will sample a fixed ratio of received traces. But depending on the overall processing pipeline you should prefer using one over the other.
As a rule of thumb, if you want to add probabilistic sampling and...
...you are not using the tail sampling processor already: use the probabilistic sampling processor. Running the probabilistic sampling processor is more efficient than the tail sampling processor. The probabilistic sampling policy makes decision based upon the trace ID, so waiting until more spans have arrived will not influence its decision.
...you are already using the tail sampling processor: add the probabilistic sampling policy. You are already incurring the cost of running the tail sampling processor, adding the probabilistic policy will be negligible. Additionally, using the policy within the tail sampling processor will ensure traces that are sampled by other policies will not be dropped.
Q. Why am I seeing high values for the error metric sampling_trace_dropped_too_early
?
A. This is likely a load issue. If the collector is processing more traces in-memory than the num_traces
configuration
option allows, some will have to be dropped before they can be sampled. Increasing the value of num_traces
can
help resolve this error, at the expense of increased memory usage.