Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calls_total metrics number not matching with no of traces seen in Jaeger #33857

Open
samuelchrist opened this issue Jul 2, 2024 · 10 comments
Open
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage

Comments

@samuelchrist
Copy link

Component(s)

connector/spanmetrics

What happened?

Description

In our current setup. we have otel-collector-agent running as Daemon Set and these Daemon set forward the traffic to otel-collector-gateway. Gateway would forward the traffic to Prometheus and jaeger.

I noticed the number of traces in the jaeger is consistently more compared to calls_total. I checked for

  • Large time window
  • At different level (Like at service level, environment level)
  • Still same. Need some help to understand why is that way

Steps to Reproduce

Expected Result

The calls_total counts to match with traces in jaeger counts

Actual Result

The calls_total counts does not match with the jaeger trace count

Collector version

0.100.0

Environment information

Environment

Running on k8s as pods

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ${env:MY_POD_IP}:4317
      http:
        endpoint: ${env:MY_POD_IP}:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: otel-collector-agent
        scrape_interval: 15s
        static_configs:
        - targets:
          - ${env:MY_POD_IP}:8888
        metric_relabel_configs:
        - action: labeldrop
          regex: "service_instance_id|service_name|http_scheme|net_host_port"
  hostmetrics/agent:
      collection_interval: 15s
      scrapers:
        cpu:
          metrics:
            system.cpu.logical.count:
              enabled: true
        memory:
          metrics:
            system.memory.utilization:
              enabled: true
            system.memory.limit:
              enabled: true
        load:
        disk:
        filesystem:
          metrics:
            system.filesystem.utilization:
              enabled: true
        network:
        paging:
        processes:
        process:
          mute_process_user_error: true
          metrics:
            process.cpu.utilization:
              enabled: true
            process.memory.utilization:
              enabled: true
            process.threads:
              enabled: true
            process.paging.faults:
              enabled: true
  kubeletstats:
    collection_interval: 15s
    auth_type: "serviceAccount"
    endpoint: "https://${env:NODE_IP}:10250"
    insecure_skip_verify: true
    k8s_api_config:
      auth_type: serviceAccount
    metric_groups:
      - node
      - pod
      - container
      - volume
    metrics:
      container.cpu.usage:
        enabled: true
      k8s.container.cpu_limit_utilization:
        enabled: true
      k8s.container.cpu_request_utilization:
        enabled: true
      container.uptime:
        enabled: true
      k8s.container.memory_limit_utilization:
        enabled: true
      k8s.container.memory_request_utilization:
        enabled: true
      k8s.pod.cpu_limit_utilization:
        enabled: true
      k8s.pod.cpu_request_utilization:
        enabled: true
      k8s.pod.memory_limit_utilization:
        enabled: true
      k8s.pod.memory_request_utilization:
        enabled: true
      k8s.pod.uptime:
        enabled: true
      k8s.node.uptime:
        enabled: true


  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: beginning
    include_file_path: true
    include_file_name: false

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
    dimensions:
      - name: url.path
      - name: http.response.status_code
      - name: http.request.method
    exemplars:
      enabled: true
    exclude_dimensions: []
    dimensions_cache_size: 5000000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
    metrics_flush_interval: 1m
    metrics_expiration: 5m
    events:
      enabled: true
      dimensions:
        - name: exception.type
        - name: exception.message
    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name


exporters:
  otlp:
    endpoint: "otel-collector-gateway.{{.Values.namespace}}.svc.cluster.local:4317"
    tls:
      insecure: true
    sending_queue:
      num_consumers: 4
      queue_size: 15000
    retry_on_failure:
      enabled: true

  logging:
    loglevel: debug
  debug:
    verbosity: detailed

processors:
  batch: {}
  memory_limiter:
    # 80% of maximum memory up to 2G
    limit_mib: 800
    # 25% of limit up to 2G
    spike_limit_mib: 100
    check_interval: 1s
  k8sattributes:
    auth_type: 'serviceAccount'
    extract:
      metadata:
      - k8s.namespace.name
      - k8s.deployment.name
      - k8s.statefulset.name
      - k8s.daemonset.name
      - k8s.cronjob.name
      - k8s.job.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
  resourcedetection/eks:
    detectors: [env, eks]
    timeout: 15s
    override: false



extensions:
  health_check:
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      address: ${env:MY_POD_IP}:8888
      level: detailed
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [k8sattributes, resourcedetection/eks, resourcedetection/system , memory_limiter, batch]
      exporters: [otlp, spanmetrics]
    metrics:
      receivers:
      - kubeletstats
      - spanmetrics
      - hostmetrics/agent
      - otlp
      - prometheus
      processors:
      - k8sattributes
      - resourcedetection/system
      - resourcedetection/eks
      - memory_limiter
      - batch
      exporters:
      - otlp
    logs:
      receivers: [otlp, filelog]
      processors: [k8sattributes, resourcedetection/system,resourcedetection/eks , memory_limiter, batch]
      exporters: [otlp]

Log output

No response

Additional context

No response

@samuelchrist samuelchrist added bug Something isn't working needs triage New item requiring triage labels Jul 2, 2024
Copy link
Contributor

github-actions bot commented Jul 2, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Frapschen
Copy link
Contributor

Hi, @samuelchrist, can you provide more context about your issue? Screenshots or more detailed content are welcome to help us understand your question.

@samuelchrist
Copy link
Author

samuelchrist commented Jul 2, 2024

Hi @Frapschen ,

I am using otel-collector-agent (running as Daemonset on all nodes) and otel-collector-gateway ( running as deployment just 1 pod). The instrumentation of application(mainly java applications) is using open telemetry sdk 1.37.0.

All the telemetry data sent from application are sent to otel-agent and forwarded to otel-collector-gateway using otlp. The data from otel-collector gateway is send to different backends like Prometheus for traces and jaeger for logs. Data is sent do Prometheus using Prometheus-exporter (not remote write). Prometheus scrapes data for every 15s.

I am using spanmetrics connector. To get the metrics about the traces. I see the number of traces when directly checked in jaeger explore matching to what we see in Datadog dashboard (which will be removed) but calls_total metrics is producing lower counts which makes is not reliable.

Below is the screenshot where I have queried the data for same time window, same service_name, url/endpoints etc(all tags same) but the total number of hits to the endpoints for same time range is low in calls_total compared to the count in jaeger.
image

What i tried

  1. Tried using only few dimensions in spanmetrics which i would need ( the same i using to query as well)
        dimensions:
          - name: url.path
          - name: http.response.status_code
          - name: http.request.method
  1. increasing dimension cache from 1000 to 50000
    dimensions_cache_size: 500000

  2. increasing metrics flush from 15s to 1m
    metrics_flush_interval: 1m

nothing worked so far. Any input is much appreciated.

@samuelchrist
Copy link
Author

Hi @Frapschen,

Any inputs. Like what other configs change in Span metrics connector config can I try changing which you feel can help me debug the root cause?

@swar8080
Copy link
Contributor

swar8080 commented Jul 3, 2024

@samuelchrist try debugging by graphing sum(calls_total{...}) to check if the raw series have the expected count. increase has some limitations that makes it inaccurate. For example, the first count pushed for a series is ignored by increase since the counter doesn't start at zero

@samuelchrist
Copy link
Author

samuelchrist commented Jul 3, 2024

@swar8080, Thanks for the input. I did try using sum(calls_total{...}) - sum(calls_total{...} offset 1h) but the value between promql using increase which I have used, and this one was very close.

even for larger time window

edit
image

@Frapschen
Copy link
Contributor

Frapschen commented Jul 3, 2024

@samuelchrist I suspect the Total number rows 69 not only contain the spans with the desired tags but also contain the external spans in the same trace_id, however sum(calls_total{...}) only returns the count you desired.

If you could, maybe you can write the specific SQL or other query language to query the span storage directly.

@samuelchrist
Copy link
Author

Hi @swar8080 and @Frapschen,

Do we have any detailed document as to how spanmetrics works under the hood and what and how each parameter of span metrics connector work. I tried the spanmetrics readme but still not so clear.

@samuelchrist
Copy link
Author

samuelchrist commented Jul 3, 2024

Hi @swar8080 and @Frapschen ,

Found another thing which might be causing the count difference. I have added url_path tag under dimension along with other dimensions. I noticed that the url_path which is for one client is showing for a different client. Not sure what is causing the issue. I went through the spanmetrics documentation. I am not able to find the root cause.

Working on tight timeline. Any suggestion is highly appreciated

@samuelchrist
Copy link
Author

@samuelchrist I suspect the Total number rows 69 not only contain the spans with the desired tags but also contain the external spans in the same trace_id, however sum(calls_total{...}) only returns the count you desired.

If you could, maybe you can write the specific SQL or other query language to query the span storage directly.

I have verified its counting only the traces @Frapschen not spans from other traces

The issue was labels are getting tagged in correctly. Meaning as mentioned above the url_path of client A is getting attached to client B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

3 participants