calls_total metrics number not matching with no of traces seen in Jaeger #33857

samuelchrist · 2024-07-02T12:51:14Z

Component(s)

connector/spanmetrics

What happened?

Description

In our current setup. we have otel-collector-agent running as Daemon Set and these Daemon set forward the traffic to otel-collector-gateway. Gateway would forward the traffic to Prometheus and jaeger.

I noticed the number of traces in the jaeger is consistently more compared to calls_total. I checked for

Large time window
At different level (Like at service level, environment level)
Still same. Need some help to understand why is that way

Steps to Reproduce

Expected Result

The calls_total counts to match with traces in jaeger counts

Actual Result

The calls_total counts does not match with the jaeger trace count

Collector version

0.100.0

Environment information

Environment

Running on k8s as pods

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ${env:MY_POD_IP}:4317
      http:
        endpoint: ${env:MY_POD_IP}:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: otel-collector-agent
        scrape_interval: 15s
        static_configs:
        - targets:
          - ${env:MY_POD_IP}:8888
        metric_relabel_configs:
        - action: labeldrop
          regex: "service_instance_id|service_name|http_scheme|net_host_port"
  hostmetrics/agent:
      collection_interval: 15s
      scrapers:
        cpu:
          metrics:
            system.cpu.logical.count:
              enabled: true
        memory:
          metrics:
            system.memory.utilization:
              enabled: true
            system.memory.limit:
              enabled: true
        load:
        disk:
        filesystem:
          metrics:
            system.filesystem.utilization:
              enabled: true
        network:
        paging:
        processes:
        process:
          mute_process_user_error: true
          metrics:
            process.cpu.utilization:
              enabled: true
            process.memory.utilization:
              enabled: true
            process.threads:
              enabled: true
            process.paging.faults:
              enabled: true
  kubeletstats:
    collection_interval: 15s
    auth_type: "serviceAccount"
    endpoint: "https://${env:NODE_IP}:10250"
    insecure_skip_verify: true
    k8s_api_config:
      auth_type: serviceAccount
    metric_groups:
      - node
      - pod
      - container
      - volume
    metrics:
      container.cpu.usage:
        enabled: true
      k8s.container.cpu_limit_utilization:
        enabled: true
      k8s.container.cpu_request_utilization:
        enabled: true
      container.uptime:
        enabled: true
      k8s.container.memory_limit_utilization:
        enabled: true
      k8s.container.memory_request_utilization:
        enabled: true
      k8s.pod.cpu_limit_utilization:
        enabled: true
      k8s.pod.cpu_request_utilization:
        enabled: true
      k8s.pod.memory_limit_utilization:
        enabled: true
      k8s.pod.memory_request_utilization:
        enabled: true
      k8s.pod.uptime:
        enabled: true
      k8s.node.uptime:
        enabled: true


  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: beginning
    include_file_path: true
    include_file_name: false

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
    dimensions:
      - name: url.path
      - name: http.response.status_code
      - name: http.request.method
    exemplars:
      enabled: true
    exclude_dimensions: []
    dimensions_cache_size: 5000000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
    metrics_flush_interval: 1m
    metrics_expiration: 5m
    events:
      enabled: true
      dimensions:
        - name: exception.type
        - name: exception.message
    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name


exporters:
  otlp:
    endpoint: "otel-collector-gateway.{{.Values.namespace}}.svc.cluster.local:4317"
    tls:
      insecure: true
    sending_queue:
      num_consumers: 4
      queue_size: 15000
    retry_on_failure:
      enabled: true

  logging:
    loglevel: debug
  debug:
    verbosity: detailed

processors:
  batch: {}
  memory_limiter:
    # 80% of maximum memory up to 2G
    limit_mib: 800
    # 25% of limit up to 2G
    spike_limit_mib: 100
    check_interval: 1s
  k8sattributes:
    auth_type: 'serviceAccount'
    extract:
      metadata:
      - k8s.namespace.name
      - k8s.deployment.name
      - k8s.statefulset.name
      - k8s.daemonset.name
      - k8s.cronjob.name
      - k8s.job.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
  resourcedetection/eks:
    detectors: [env, eks]
    timeout: 15s
    override: false



extensions:
  health_check:
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      address: ${env:MY_POD_IP}:8888
      level: detailed
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [k8sattributes, resourcedetection/eks, resourcedetection/system , memory_limiter, batch]
      exporters: [otlp, spanmetrics]
    metrics:
      receivers:
      - kubeletstats
      - spanmetrics
      - hostmetrics/agent
      - otlp
      - prometheus
      processors:
      - k8sattributes
      - resourcedetection/system
      - resourcedetection/eks
      - memory_limiter
      - batch
      exporters:
      - otlp
    logs:
      receivers: [otlp, filelog]
      processors: [k8sattributes, resourcedetection/system,resourcedetection/eks , memory_limiter, batch]
      exporters: [otlp]

Log output

No response

Additional context

No response

github-actions · 2024-07-02T12:51:34Z

Pinging code owners:

connector/spanmetrics: @portertech @Frapschen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Frapschen · 2024-07-02T13:05:41Z

Hi, @samuelchrist, can you provide more context about your issue? Screenshots or more detailed content are welcome to help us understand your question.

samuelchrist · 2024-07-02T14:10:33Z

Hi @Frapschen ,

I am using otel-collector-agent (running as Daemonset on all nodes) and otel-collector-gateway ( running as deployment just 1 pod). The instrumentation of application(mainly java applications) is using open telemetry sdk 1.37.0.

All the telemetry data sent from application are sent to otel-agent and forwarded to otel-collector-gateway using otlp. The data from otel-collector gateway is send to different backends like Prometheus for traces and jaeger for logs. Data is sent do Prometheus using Prometheus-exporter (not remote write). Prometheus scrapes data for every 15s.

I am using spanmetrics connector. To get the metrics about the traces. I see the number of traces when directly checked in jaeger explore matching to what we see in Datadog dashboard (which will be removed) but calls_total metrics is producing lower counts which makes is not reliable.

Below is the screenshot where I have queried the data for same time window, same service_name, url/endpoints etc(all tags same) but the total number of hits to the endpoints for same time range is low in calls_total compared to the count in jaeger.

What i tried

Tried using only few dimensions in spanmetrics which i would need ( the same i using to query as well)

        dimensions:
          - name: url.path
          - name: http.response.status_code
          - name: http.request.method

increasing dimension cache from 1000 to 50000
dimensions_cache_size: 500000
increasing metrics flush from 15s to 1m
metrics_flush_interval: 1m

nothing worked so far. Any input is much appreciated.

samuelchrist · 2024-07-02T16:12:24Z

Hi @Frapschen,

Any inputs. Like what other configs change in Span metrics connector config can I try changing which you feel can help me debug the root cause?

swar8080 · 2024-07-03T03:25:36Z

@samuelchrist try debugging by graphing sum(calls_total{...}) to check if the raw series have the expected count. increase has some limitations that makes it inaccurate. For example, the first count pushed for a series is ignored by increase since the counter doesn't start at zero

samuelchrist · 2024-07-03T06:26:09Z

@swar8080, Thanks for the input. I did try using sum(calls_total{...}) - sum(calls_total{...} offset 1h) but the value between promql using increase which I have used, and this one was very close.

even for larger time window

edit

Frapschen · 2024-07-03T07:32:12Z

@samuelchrist I suspect the Total number rows 69 not only contain the spans with the desired tags but also contain the external spans in the same trace_id, however sum(calls_total{...}) only returns the count you desired.

If you could, maybe you can write the specific SQL or other query language to query the span storage directly.

samuelchrist · 2024-07-03T09:17:01Z

Hi @swar8080 and @Frapschen,

Do we have any detailed document as to how spanmetrics works under the hood and what and how each parameter of span metrics connector work. I tried the spanmetrics readme but still not so clear.

samuelchrist · 2024-07-03T13:18:39Z

Hi @swar8080 and @Frapschen ,

Found another thing which might be causing the count difference. I have added url_path tag under dimension along with other dimensions. I noticed that the url_path which is for one client is showing for a different client. Not sure what is causing the issue. I went through the spanmetrics documentation. I am not able to find the root cause.

Working on tight timeline. Any suggestion is highly appreciated

samuelchrist · 2024-07-05T15:47:02Z

@samuelchrist I suspect the Total number rows 69 not only contain the spans with the desired tags but also contain the external spans in the same trace_id, however sum(calls_total{...}) only returns the count you desired.

If you could, maybe you can write the specific SQL or other query language to query the span storage directly.

I have verified its counting only the traces @Frapschen not spans from other traces

The issue was labels are getting tagged in correctly. Meaning as mentioned above the url_path of client A is getting attached to client B.

samuelchrist added bug Something isn't working needs triage New item requiring triage labels Jul 2, 2024

github-actions bot added the connector/spanmetrics label Jul 2, 2024

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Open

This was referenced Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Open

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Open

This was referenced Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Open

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calls_total metrics number not matching with no of traces seen in Jaeger #33857

calls_total metrics number not matching with no of traces seen in Jaeger #33857

samuelchrist commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

Frapschen commented Jul 2, 2024

samuelchrist commented Jul 2, 2024 •

edited

Loading

samuelchrist commented Jul 2, 2024

swar8080 commented Jul 3, 2024

samuelchrist commented Jul 3, 2024 •

edited

Loading

Frapschen commented Jul 3, 2024 •

edited

Loading

samuelchrist commented Jul 3, 2024

samuelchrist commented Jul 3, 2024 •

edited

Loading

samuelchrist commented Jul 5, 2024

calls_total metrics number not matching with no of traces seen in Jaeger #33857

calls_total metrics number not matching with no of traces seen in Jaeger #33857

Comments

samuelchrist commented Jul 2, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jul 2, 2024

Frapschen commented Jul 2, 2024

samuelchrist commented Jul 2, 2024 • edited Loading

What i tried

samuelchrist commented Jul 2, 2024

swar8080 commented Jul 3, 2024

samuelchrist commented Jul 3, 2024 • edited Loading

Frapschen commented Jul 3, 2024 • edited Loading

samuelchrist commented Jul 3, 2024

samuelchrist commented Jul 3, 2024 • edited Loading

samuelchrist commented Jul 5, 2024

samuelchrist commented Jul 2, 2024 •

edited

Loading

samuelchrist commented Jul 3, 2024 •

edited

Loading

Frapschen commented Jul 3, 2024 •

edited

Loading

samuelchrist commented Jul 3, 2024 •

edited

Loading