error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

zhulei-pacvue · 2024-04-15T02:01:31Z

Component(s)

exporter/prometheus

Describe the issue you're reporting

When I use Prometheusexporter, otelcol frequently reports errors as follows:

2024-04-15T01:39:07.597Z error [email protected]/log.go:23 error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe
{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
github.com/prometheus/[email protected]/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
github.com/prometheus/[email protected]/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
go.opentelemetry.io/collector/config/[email protected]/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:225
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:83
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
go.opentelemetry.io/collector/config/[email protected]/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
net/http/server.go:3137
net/http.(*conn).serve
net/http/server.go:2039

Version：
otelcol：0.97.0

Importent configurations:

exporters:
  otlp:
    endpoint: 'jaeger-collector:4317'
    tls:
      insecure: true  
  prometheus:
    endpoint: ${env:MY_POD_IP}:8889
    namespace:
    send_timestamps: false
    metric_expiration: 10m
    add_metric_suffixes: false

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-15T02:01:50Z

Pinging code owners:

exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-04-15T16:36:34Z

Hello @zhulei-pacvue, did this error happen on startup, or after the collector had been running for some time? Have you seen this happen repeatedly when running the collector, or was this only one time? Can you share more about what kind of environment the collector was running in?

zhulei-pacvue · 2024-04-17T01:16:55Z

@crobert-1 Thank you! This error often occurs after the collector has been running for some time. After the service is restarted, it can run normally.

AndreasPetersen · 2024-05-16T12:40:14Z

We're experiencing the same issue:

error	[email protected]/log.go:23	error encoding and sending metric family: write tcp 100.78.45.37:8889->100.78.6.13:46202: write: broken pipe
	{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
	github.com/prometheus/[email protected]/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
	github.com/prometheus/[email protected]/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
	net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
	net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
	go.opentelemetry.io/collector/config/[email protected]/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
	go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:214
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
	go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:72
net/http.HandlerFunc.ServeHTTP
	net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
	go.opentelemetry.io/collector/config/[email protected]/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
	net/http/server.go:3137
net/http.(*conn).serve
	net/http/server.go:2039

Usually there's about 10 or 20 such logs that happen at about the same time. Some of then happen at the exact same time, while others are a few ms apart.

In the last week alone, this happens about once or twice a day:

We're running this on OpenShift using Kubernetes v1.26.13+8f85140.

The OpenTelemetry collector runs as a container in a pod with a Quarkus service sending metrics and traces to the OpenTelemetry Collector.

Here is the Kubernetes Deployment resource:

kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
    # ...
  name: service
  namespace: my-namespace
  labels:
    # ...
spec:
  replicas: 2
  selector:
    matchLabels:
      # ...
  template:
    metadata:
      labels:
        # ...
      annotations:
        # ...
        prometheus.io/path: /metrics
        prometheus.io/port: '8889'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: otel-config
          configMap:
            name: otel-config
            defaultMode: 420
        - name: opa-conf
          configMap:
            name: opa
            defaultMode: 420
      containers:
        - resources:
            limits:
              cpu: 500m
              memory: 1536Mi
            requests:
              cpu: 10m
              memory: 1536Mi
          readinessProbe:
            httpGet:
              path: /q/health/ready
              port: container-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 7
            successThreshold: 1
            failureThreshold: 7
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - sleep
                  - '5'
          name: service
          livenessProbe:
            httpGet:
              path: /q/health/live
              port: container-port
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            # ...
          ports:
            - name: container-port
              containerPort: 8080
              protocol: TCP
          imagePullPolicy: IfNotPresent
          terminationMessagePolicy: File
          image: my-service
        - resources:
            limits:
              cpu: 10m
              memory: 150Mi
            requests:
              cpu: 10m
              memory: 150Mi
          readinessProbe:
            httpGet:
              path: /
              port: otel-health
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    while curl -X GET http:https://localhost:8080/q/health/live ; do
                    sleep 1; done
          name: otel-collector
          livenessProbe:
            httpGet:
              path: /
              port: otel-health
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: TZ
              value: Europe/Copenhagen
          ports:
            - name: otel-health
              containerPort: 13133
              protocol: TCP
            - name: otlp-grpc
              containerPort: 4317
              protocol: TCP
            - name: otlp-http
              containerPort: 4318
              protocol: TCP
            - name: jaeger-grpc
              containerPort: 14260
              protocol: TCP
            - name: jaeger-http
              containerPort: 14268
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: otel-config
              mountPath: /etc/config/otel
          terminationMessagePolicy: File
          # We use RedHat's ubi8-minimal as a base image, and install OpenTelemetry otelcol_VERSION_linux_amd64.rpm from https://github.com/open-telemetry/opentelemetry-collector-releases/releases onto it
          image: >-
            my-repo/opentelemetry-collector:0.99.0
          args:
            - otelcol
            - '--config=/etc/config/otel/otel-collector-config.yaml'
        - resources:
            limits:
              cpu: 50m
              memory: 150Mi
            requests:
              cpu: 10m
              memory: 150Mi
          readinessProbe:
            httpGet:
              path: /health?bundle=true
              port: opa-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 7
            successThreshold: 1
            failureThreshold: 7
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    while curl -X GET http:https://localhost:8080/q/health/live ; do
                    sleep 1; done
          name: opa
          livenessProbe:
            httpGet:
              path: /health
              port: opa-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: TZ
              value: Europe/Copenhagen
          ports:
            - name: opa-port
              containerPort: 8181
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: opa-conf
              readOnly: true
              mountPath: /opa-conf
          terminationMessagePolicy: File
          image: >-
            opa:0.63.0
          args:
            - run
            - '--ignore=.*'
            - '--server'
            - '--log-level'
            - error
            - '--config-file'
            - /opa-conf/opa-conf.yaml
            - '--watch'
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

The OpenTelemetry Collector config:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "opa"
          scrape_interval: 15s
          metrics_path: "/metrics"
          static_configs:
            - targets:
                - "0.0.0.0:8181"
  otlp:
    protocols:
      grpc:
        endpoint:
          0.0.0.0:4317
      http:
        endpoint:
          0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint:
          0.0.0.0:14260
      thrift_http:
        endpoint:
          0.0.0.0:14268
processors:
  batch:
    timeout: 1s
exporters:
  prometheus:
    add_metric_suffixes: false
    endpoint:
      0.0.0.0:8889
  otlp:
    endpoint: "ose-jaeger-collector-headless.ose-jaeger.svc.cluster.local:4317"
    tls:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
      insecure: false
      min_version: "1.2"
extensions:
  health_check:
    endpoint:
      0.0.0.0:13133
service:
  extensions: [ health_check ]
  pipelines:
    metrics:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ prometheus ]
    traces:
      receivers: [ otlp, jaeger ]
      processors: [ batch ]
      exporters: [ otlp ]
  telemetry:
    metrics:
      address: 0.0.0.0:8888

CPU and memory usage of the OpenTelemetry Collector shows nothing abnormal when the error occurs:

L3o-pold · 2024-05-16T12:47:43Z

Same issue since we upgraded from 0.88.0 to 0.100.0

OfekCyberX · 2024-07-03T13:39:37Z

Hi @crobert-1 , is there any ongoing effort to fix/mitigate this issue?
we're have a deployment of otel-collector in k8s and we face the exact same issue, same stacktrace with the prometheusexporter error:

github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/log.go:23

We also tried to upgrade to latest release 0.97.0 - didn't change.

We have the otel-collector running on a dedicated node without hard limit, only requests:

resources:
  requests:
    cpu: 32
    memory: 128Gi

Memory usage & CPU usage are well under limit.
Mem is 12%, CPU is 6%.

Seems it starts happening when we increase the metrics ingest volume.
We have clients sending metrics every 1 min, and what caused this error to start appearing is increasing the number of clients.

Is there any fix/mitigation for this issue? via a param or config change?

Thanks

crobert-1 · 2024-07-03T17:55:38Z

My apologies, I'm largely unfamiliar with this component and its functionality. I'm not aware of any ongoing effort to address this. @Aneurysm9 do you have any suggestions here?

zhulei-pacvue added the needs triage New item requiring triage label Apr 15, 2024

github-actions bot added the exporter/prometheus label Apr 15, 2024

crobert-1 added the bug Something isn't working label Apr 15, 2024

github-actions bot mentioned this issue Apr 16, 2024

Weekly Report: 2024-04-09 - 2024-04-16 #32407

Closed

github-actions bot mentioned this issue May 28, 2024

Weekly Report: 2024-05-21 - 2024-05-28 #33243

Open

github-actions bot mentioned this issue Jul 2, 2024

Weekly Report: 2024-06-25 - 2024-07-02 #33839

Open

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

zhulei-pacvue commented Apr 15, 2024

github-actions bot commented Apr 15, 2024

crobert-1 commented Apr 15, 2024

zhulei-pacvue commented Apr 17, 2024

AndreasPetersen commented May 16, 2024

L3o-pold commented May 16, 2024

OfekCyberX commented Jul 3, 2024 •

edited

Loading

crobert-1 commented Jul 3, 2024

error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

Comments

zhulei-pacvue commented Apr 15, 2024

Component(s)

Describe the issue you're reporting

github-actions bot commented Apr 15, 2024

crobert-1 commented Apr 15, 2024

zhulei-pacvue commented Apr 17, 2024

AndreasPetersen commented May 16, 2024

L3o-pold commented May 16, 2024

OfekCyberX commented Jul 3, 2024 • edited Loading

crobert-1 commented Jul 3, 2024

OfekCyberX commented Jul 3, 2024 •

edited

Loading