Skip to content

Commit

Permalink
[grafana-sampling] add sampling helm chart (grafana#2918)
Browse files Browse the repository at this point in the history
* add sampling helm chart

Signed-off-by: Robbie Lankford <[email protected]>

* wire metrics generation toggle

Signed-off-by: Robbie Lankford <[email protected]>

* add simpified sampling policies

Signed-off-by: Robbie Lankford <[email protected]>

* set 2 replicas and disable autoscaling by default

Signed-off-by: Robbie Lankford <[email protected]>

* set back to 1 replicas by default to pass ci tests

Signed-off-by: Robbie Lankford <[email protected]>

* use kubernetes resolver for loadbalancing exporter

Signed-off-by: Robbie Lankford <[email protected]>

* add README.md

Signed-off-by: Robbie Lankford <[email protected]>

* helm-docs

Signed-off-by: Robbie Lankford <[email protected]>

* helm-docs

Signed-off-by: Robbie Lankford <[email protected]>

* update helm-docs; add decision wait

Signed-off-by: Robbie Lankford <[email protected]>

* helm-docs and fix typo

Signed-off-by: Robbie Lankford <[email protected]>

* quote decision_wait

Signed-off-by: Robbie Lankford <[email protected]>

* add transform to drop unneeded resource attributes for spanmetrics

Signed-off-by: Robbie Lankford <[email protected]>

* more doc updates

Signed-off-by: Robbie Lankford <[email protected]>

* more doc updates

Signed-off-by: Robbie Lankford <[email protected]>

* move sampling to grafana-sampling

Signed-off-by: Robbie Lankford <[email protected]>

* additional docs updates

Signed-off-by: Robbie Lankford <[email protected]>

* remove sample file

Signed-off-by: Robbie Lankford <[email protected]>

* shorten names to pass tests

Signed-off-by: Robbie Lankford <[email protected]>

* update png and metrics pipeline order based on PR review

Signed-off-by: Robbie Lankford <[email protected]>

* remove k8s.pod.name from default dimensions

Signed-off-by: Robbie Lankford <[email protected]>

---------

Signed-off-by: Robbie Lankford <[email protected]>
  • Loading branch information
rlankfo committed Apr 4, 2024
1 parent a50f643 commit 3761a1f
Show file tree
Hide file tree
Showing 24 changed files with 723 additions and 0 deletions.
23 changes: 23 additions & 0 deletions charts/grafana-sampling/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
9 changes: 9 additions & 0 deletions charts/grafana-sampling/Chart.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
dependencies:
- name: grafana-agent
repository: https://grafana.github.io/helm-charts
version: 0.36.0
- name: grafana-agent
repository: https://grafana.github.io/helm-charts
version: 0.36.0
digest: sha256:6d04a55dce2c09c4c250c6453e0d58f7280750bf04fce51027b4e235062413e5
generated: "2024-03-11T15:41:30.921516-07:00"
18 changes: 18 additions & 0 deletions charts/grafana-sampling/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: v2
name: grafana-sampling
description: A Helm chart for a layered OTLP tail sampling and metrics generation pipeline.
type: application
version: 0.1.0
appVersion: "v0.40.2"
sources:
- https://github.com/grafana/agent
- https://grafana.com/docs/grafana-cloud/monitor-applications/application-observability/setup/sampling/tail/
dependencies:
- name: grafana-agent
version: 0.36.0
repository: https://grafana.github.io/helm-charts
alias: grafana-agent-deployment
- name: grafana-agent
version: 0.36.0
repository: https://grafana.github.io/helm-charts
alias: grafana-agent-statefulset
124 changes: 124 additions & 0 deletions charts/grafana-sampling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# grafana-sampling

![Version: 0.1.0](https://img.shields.io/badge/Version-0.1.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: v0.40.2](https://img.shields.io/badge/AppVersion-v0.40.2-informational?style=flat-square)

A Helm chart for a layered OTLP tail sampling and metrics generation pipeline.

This chart deploys the following architecture to your environment:
![Photo of sampling architecture](./sampling-architecture.png)

Note: by default, only OTLP traces are accepted at the load balancing layer.

## Chart Repo

Add the following repo to use the chart:

```console
helm repo add grafana https://grafana.github.io/helm-charts
```
## Installing the Chart

Use the following command to install the chart with the release name `my-release`. Make sure to populate the required values.

```console
helm install my-release grafana/grafana-sampling --values - <<EOF | less
grafana-agent-statefulset:
agent:
extraEnv:
- name: GRAFANA_CLOUD_API_KEY
value: <REQUIRED>
- name: GRAFANA_CLOUD_PROMETHEUS_URL
value: <REQUIRED>
- name: GRAFANA_CLOUD_PROMETHEUS_USERNAME
value: <REQUIRED>
- name: GRAFANA_CLOUD_TEMPO_ENDPOINT
value: <REQUIRED>
- name: GRAFANA_CLOUD_TEMPO_USERNAME
value: <REQUIRED>
# This is required for adaptive metric deduplication in Grafana Cloud
- name: POD_UID
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.uid
EOF
```

## Uninstalling the Chart

To uninstall/delete the my-release deployment:

```console
helm delete my-release
```

The command removes all the Kubernetes components associated with the chart and deletes the release.

## Upgrading

A major chart version change indicates that there is an incompatible breaking change needing manual actions.

## Values

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| grafana-agent-deployment.agent.configMap.create | bool | `false` | |
| grafana-agent-deployment.agent.extraPorts[0].name | string | `"otlp-grpc"` | |
| grafana-agent-deployment.agent.extraPorts[0].port | int | `4317` | |
| grafana-agent-deployment.agent.extraPorts[0].protocol | string | `"TCP"` | |
| grafana-agent-deployment.agent.extraPorts[0].targetPort | int | `4317` | |
| grafana-agent-deployment.agent.extraPorts[1].name | string | `"otlp-http"` | |
| grafana-agent-deployment.agent.extraPorts[1].port | int | `4318` | |
| grafana-agent-deployment.agent.extraPorts[1].protocol | string | `"TCP"` | |
| grafana-agent-deployment.agent.extraPorts[1].targetPort | int | `4318` | |
| grafana-agent-deployment.agent.resources.requests.cpu | string | `"1"` | |
| grafana-agent-deployment.agent.resources.requests.memory | string | `"2G"` | |
| grafana-agent-deployment.controller.autoscaling.enabled | bool | `false` | Creates a HorizontalPodAutoscaler for controller type deployment. |
| grafana-agent-deployment.controller.autoscaling.maxReplicas | int | `5` | The upper limit for the number of replicas to which the autoscaler can scale up. |
| grafana-agent-deployment.controller.autoscaling.minReplicas | int | `2` | The lower limit for the number of replicas to which the autoscaler can scale down. |
| grafana-agent-deployment.controller.autoscaling.targetCPUUtilizationPercentage | int | `0` | Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling. |
| grafana-agent-deployment.controller.autoscaling.targetMemoryUtilizationPercentage | int | `80` | Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling. |
| grafana-agent-deployment.controller.replicas | int | `1` | |
| grafana-agent-deployment.controller.type | string | `"deployment"` | |
| grafana-agent-deployment.nameOverride | string | `"deployment"` | Do not change this. |
| grafana-agent-statefulset.agent.configMap.create | bool | `false` | |
| grafana-agent-statefulset.agent.extraEnv[0].name | string | `"GRAFANA_CLOUD_API_KEY"` | |
| grafana-agent-statefulset.agent.extraEnv[0].value | string | `"<REQUIRED>"` | |
| grafana-agent-statefulset.agent.extraEnv[1].name | string | `"GRAFANA_CLOUD_PROMETHEUS_URL"` | |
| grafana-agent-statefulset.agent.extraEnv[1].value | string | `"<REQUIRED>"` | |
| grafana-agent-statefulset.agent.extraEnv[2].name | string | `"GRAFANA_CLOUD_PROMETHEUS_USERNAME"` | |
| grafana-agent-statefulset.agent.extraEnv[2].value | string | `"<REQUIRED>"` | |
| grafana-agent-statefulset.agent.extraEnv[3].name | string | `"GRAFANA_CLOUD_TEMPO_ENDPOINT"` | |
| grafana-agent-statefulset.agent.extraEnv[3].value | string | `"<REQUIRED>"` | |
| grafana-agent-statefulset.agent.extraEnv[4].name | string | `"GRAFANA_CLOUD_TEMPO_USERNAME"` | |
| grafana-agent-statefulset.agent.extraEnv[4].value | string | `"<REQUIRED>"` | |
| grafana-agent-statefulset.agent.extraEnv[5].name | string | `"POD_UID"` | |
| grafana-agent-statefulset.agent.extraEnv[5].valueFrom.fieldRef.apiVersion | string | `"v1"` | |
| grafana-agent-statefulset.agent.extraEnv[5].valueFrom.fieldRef.fieldPath | string | `"metadata.uid"` | |
| grafana-agent-statefulset.agent.extraPorts[0].name | string | `"otlp-grpc"` | |
| grafana-agent-statefulset.agent.extraPorts[0].port | int | `4317` | |
| grafana-agent-statefulset.agent.extraPorts[0].protocol | string | `"TCP"` | |
| grafana-agent-statefulset.agent.extraPorts[0].targetPort | int | `4317` | |
| grafana-agent-statefulset.agent.resources.requests.cpu | string | `"1"` | |
| grafana-agent-statefulset.agent.resources.requests.memory | string | `"2G"` | |
| grafana-agent-statefulset.controller.autoscaling.enabled | bool | `false` | Creates a HorizontalPodAutoscaler for controller type deployment. |
| grafana-agent-statefulset.controller.autoscaling.maxReplicas | int | `5` | The upper limit for the number of replicas to which the autoscaler can scale up. |
| grafana-agent-statefulset.controller.autoscaling.minReplicas | int | `2` | The lower limit for the number of replicas to which the autoscaler can scale down. |
| grafana-agent-statefulset.controller.autoscaling.targetCPUUtilizationPercentage | int | `0` | Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling. |
| grafana-agent-statefulset.controller.autoscaling.targetMemoryUtilizationPercentage | int | `80` | Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling. |
| grafana-agent-statefulset.controller.replicas | int | `1` | |
| grafana-agent-statefulset.controller.type | string | `"statefulset"` | |
| grafana-agent-statefulset.nameOverride | string | `"statefulset"` | Do not change this. |
| grafana-agent-statefulset.rbac.create | bool | `false` | |
| grafana-agent-statefulset.service.clusterIP | string | `"None"` | |
| grafana-agent-statefulset.serviceAccount.create | bool | `false` | |
| metricsGeneration.dimensions | list | `["service.namespace","service.version","deployment.environment","k8s.cluster.name"]` | Additional dimensions to add to generated metrics. |
| metricsGeneration.enabled | bool | `true` | Toggle generation of spanmetrics and servicegraph metrics. |
| sampling.decisionWait | string | `"15s"` | Wait time since the first span of a trace before making a sampling decision. |
| sampling.enabled | bool | `true` | Toggle tail sampling. |
| sampling.extraPolicies | string | A policy to sample long requests is added by default. | User-defined policies in river format. |
| sampling.failedRequests.percentage | int | `50` | Percentage of failed requests to sample. |
| sampling.failedRequests.sample | bool | `false` | Toggle sampling failed requests. |
| sampling.successfulRequests.percentage | int | `10` | Percentage of successful requests to sample. |
| sampling.successfulRequests.sample | bool | `true` | Toggle sampling successful requests. |

63 changes: 63 additions & 0 deletions charts/grafana-sampling/README.md.gotmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
{{ template "chart.header" . }}

{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}

{{ template "chart.description" . }}

This chart deploys the following architecture to your environment:
![Photo of sampling architecture](./sampling-architecture.png)

Note: by default, only OTLP traces are accepted at the load balancing layer.


## Chart Repo

Add the following repo to use the chart:

```console
helm repo add grafana https://grafana.github.io/helm-charts
```
## Installing the Chart

Use the following command to install the chart with the release name `my-release`. Make sure to populate the required values.

```console
helm install my-release grafana/grafana-sampling --values - <<EOF | less
grafana-agent-statefulset:
agent:
extraEnv:
- name: GRAFANA_CLOUD_API_KEY
value: <REQUIRED>
- name: GRAFANA_CLOUD_PROMETHEUS_URL
value: <REQUIRED>
- name: GRAFANA_CLOUD_PROMETHEUS_USERNAME
value: <REQUIRED>
- name: GRAFANA_CLOUD_TEMPO_ENDPOINT
value: <REQUIRED>
- name: GRAFANA_CLOUD_TEMPO_USERNAME
value: <REQUIRED>
# This is required for adaptive metric deduplication in Grafana Cloud
- name: POD_UID
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.uid
EOF
```

## Uninstalling the Chart

To uninstall/delete the my-release deployment:

```console
helm delete my-release
```

The command removes all the Kubernetes components associated with the chart and deletes the release.

## Upgrading

A major chart version change indicates that there is an incompatible breaking change needing manual actions.

{{ template "chart.valuesSection" . }}

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{{- define "agent.config.deployment" -}}
{{- include "deployment.receiver.otlp" . }}
{{- include "deployment.processor.batch" . }}
{{- include "deployment.exporter.loadbalancing" . }}
{{- end -}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{{- define "agent.config.statefulset" -}}
{{- include "statefulset.receiver.otlp" . }}
{{- if .Values.metricsGeneration.enabled -}}
{{- include "statefulset.connector.spanmetrics" . }}
{{- include "statefulset.processor.transform.drop_unneeded_resource_attributes" . }}
{{- include "statefulset.processor.transform.use_grafana_metric_names" . }}
{{- include "statefulset.processor.filter" . }}
{{- include "statefulset.connector.servicegraph" . }}
{{- include "statefulset.exporter.prometheus" . }}
{{- include "statefulset.prometheus.remote_write" . }}
{{- end -}}
{{- if .Values.sampling.enabled -}}
{{- include "statefulset.processor.tail_sampling" . }}
{{- end -}}
{{- include "statefulset.processor.batch" . }}
{{- include "exporter.otlp" . }}
{{- include "auth.basic" . }}
{{- end -}}
9 changes: 9 additions & 0 deletions charts/grafana-sampling/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{{/* use the release name as the serviceAccount name for deployment and statefulset agents */}}
{{- define "grafana-agent.serviceAccountName" -}}
{{- default .Release.Name }}
{{- end }}

{{/* Calculate name of image ID to use for "grafana-agent". */}}
{{- define "grafana-agent.imageId" -}}
{{- printf ":%s" .Chart.AppVersion }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{{- define "auth.basic" -}}
otelcol.auth.basic "grafana_cloud_tempo" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.auth.basic/
username = env("GRAFANA_CLOUD_TEMPO_USERNAME")
password = env("GRAFANA_CLOUD_API_KEY")
}

{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{{- define "statefulset.connector.servicegraph" -}}
otelcol.connector.servicegraph "default" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.connector.servicegraph/
dimensions = [
{{- range $.Values.metricsGeneration.dimensions }}
{{ . | quote }},
{{- end }}
]
latency_histogram_buckets = ["0s", "0.005s", "0.01s", "0.025s", "0.05s", "0.075s", "0.1s", "0.25s", "0.5s", "0.75s", "1s", "2.5s", "5s", "7.5s", "10s"]

store {
ttl = "2s"
}

output {
metrics = [otelcol.processor.batch.default.input]
}
}

{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{{- define "statefulset.connector.spanmetrics" -}}
otelcol.connector.spanmetrics "default" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.connector.spanmetrics/
{{- range $.Values.metricsGeneration.dimensions }}
dimension {
name = {{ . | quote }}
}
{{- end }}

namespace = "traces.spanmetrics"

histogram {
unit = "s"

explicit {
buckets = ["0s", "0.005s", "0.01s", "0.025s", "0.05s", "0.075s", "0.1s", "0.25s", "0.5s", "0.75s", "1s", "2.5s", "5s", "7.5s", "10s"]
}
}

output {
metrics = [otelcol.processor.filter.drop_unneeded_span_metrics.input]
}
}


{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{{- define "deployment.exporter.loadbalancing" -}}
otelcol.exporter.loadbalancing "default" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.exporter.loadbalancing/
resolver {

kubernetes {
service = "{{ .Release.Name }}-statefulset.{{ .Release.Namespace }}"
}
}

protocol {
otlp {
client {
tls {
insecure = true
}
}
}
}
}

{{ end }}
10 changes: 10 additions & 0 deletions charts/grafana-sampling/templates/_otelcol_exporter_otlp.river.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{{- define "exporter.otlp" -}}
otelcol.exporter.otlp "grafana_cloud_tempo" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.exporter.otlp/
client {
endpoint = env("GRAFANA_CLOUD_TEMPO_ENDPOINT")
auth = otelcol.auth.basic.grafana_cloud_tempo.handler
}
}

{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{{- define "statefulset.exporter.prometheus" -}}
otelcol.exporter.prometheus "grafana_cloud_prometheus" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.exporter.prometheus/
add_metric_suffixes = false
forward_to = [prometheus.remote_write.grafana_cloud_prometheus.receiver]
}

{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{{- define "deployment.processor.batch" -}}
otelcol.processor.batch "default" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.processor.batch/
output {
traces = [otelcol.exporter.loadbalancing.default.input]
}
}

{{ end }}

{{- define "statefulset.processor.batch" -}}
otelcol.processor.batch "default" {
// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.processor.batch/
output {
{{ if .Values.metricsGeneration.Enabled }}
metrics = [otelcol.exporter.prometheus.grafana_cloud_prometheus.input]
{{ end }}
traces = [otelcol.exporter.otlp.grafana_cloud_tempo.input]
}
}

{{ end }}
Loading

0 comments on commit 3761a1f

Please sign in to comment.