Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/k8scluster] Use newer v2 HorizontalPodAutoscaler for Kubernetes 1.26 #20480

Closed
jvoravong opened this issue Mar 29, 2023 · 20 comments
Closed
Assignees
Labels
bug Something isn't working receiver/k8scluster

Comments

@jvoravong
Copy link
Contributor

jvoravong commented Mar 29, 2023

Component(s)

receiver/k8scluster

What happened?

Description

Right now we only support v2beta2 HPA. To support Kubernetes v1.26, we need to add support for v2 HPA.
Kubernetes v1.26 was released in December 2022. This version is still new and distributions like AKS, EKS, Openshift, and GKE will start using it soon (if not already).

Related Startup Log Warning Message:
autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
`

Steps to Reproduce

Spin up a Kubernetes 1.25 cluster.
Deploy the k8scluster receiver to your cluster.
Follow the startup logs of the collector and you will notice the error log mentioned above.

Expected Result

The k8scluster can monitor v2 HorizontalPodAutoscaler objects.

Actual Result

In Kubernetes 1.25, you get a warning within the collector logs.
In Kubernetes 1.26, you will get an error in the logs and users might notice HPA metrics are missing that they were expecting.

Collector version

v0.72.0

Environment information

Environment

Will affect all Kubernetes 1.26 cluseters.
I tested and found the related log warnings in Rosa 4.12 (Openshift 4.12, Kubernetes 1.25).

OpenTelemetry Collector configuration

---
# Source: https://github.com/signalfx/splunk-otel-collector-chart/blob/main/examples/collector-cluster-receiver-only/rendered_manifests/configmap-cluster-receiver.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-splunk-otel-collector-otel-k8s-cluster-receiver
  labels:
    app.kubernetes.io/name: splunk-otel-collector
    helm.sh/chart: splunk-otel-collector-0.72.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: default
    app.kubernetes.io/version: "0.72.0"
    app: splunk-otel-collector
    chart: splunk-otel-collector-0.72.0
    release: default
    heritage: Helm
data:
  relay: |
    exporters:
      signalfx:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        api_url: https://api.CHANGEME.signalfx.com
        ingest_url: https://ingest.CHANGEME.signalfx.com
        timeout: 10s
      splunk_hec/o11y:
        disable_compression: true
        endpoint: https://ingest.CHANGEME.signalfx.com/v1/log
        log_data_enabled: true
        profiling_data_enabled: false
        token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
    extensions:
      health_check: null
      memory_ballast:
        size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
    processors:
      batch: null
      memory_limiter:
        check_interval: 2s
        limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
      resource:
        attributes:
        - action: insert
          key: metric_source
          value: kubernetes
        - action: upsert
          key: k8s.cluster.name
          value: CHANGEME
      resource/add_collector_k8s:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/k8s_cluster:
        attributes:
        - action: insert
          key: receiver
          value: k8scluster
      resourcedetection:
        detectors:
        - env
        - system
        override: true
        timeout: 10s
      transform/add_sourcetype:
        log_statements:
        - context: log
          statements:
          - set(resource.attributes["com.splunk.sourcetype"], Concat(["kube:object:",
            attributes["k8s.resource.name"]], ""))
    receivers:
      k8s_cluster:
        auth_type: serviceAccount
        metadata_exporters:
        - signalfx
      k8sobjects:
        auth_type: serviceAccount
        objects:
        - field_selector: status.phase=Running
          interval: 15m
          label_selector: environment in (production),tier in (frontend)
          mode: pull
          name: pods
        - group: events.k8s.io
          mode: watch
          name: events
          namespaces:
          - default
      prometheus/k8s_cluster_receiver:
        config:
          scrape_configs:
          - job_name: otel-k8s-cluster-receiver
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${K8S_POD_IP}:8889
    service:
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        logs/objects:
          exporters:
          - splunk_hec/o11y
          processors:
          - memory_limiter
          - batch
          - resourcedetection
          - resource
          - transform/add_sourcetype
          receivers:
          - k8sobjects
        metrics:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource
          - resource/k8s_cluster
          receivers:
          - k8s_cluster
        metrics/collector:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource/add_collector_k8s
          - resourcedetection
          - resource
          receivers:
          - prometheus/k8s_cluster_receiver
      telemetry:
        metrics:
          address: 0.0.0.0:8889

Log output

W0329 15:21:31.802913       1 warnings.go:70] autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
W0329 15:29:19.805634       1 warnings.go:70] autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler

Additional context

Related to: signalfx/splunk-otel-collector#2457

@jvoravong jvoravong added bug Something isn't working needs triage New item requiring triage labels Mar 29, 2023
jvoravong added a commit to jvoravong/opentelemetry-collector-contrib that referenced this issue Mar 29, 2023
@atoulme atoulme added receiver/k8scluster and removed needs triage New item requiring triage labels Mar 29, 2023
@github-actions
Copy link
Contributor

Pinging code owners for receiver/k8scluster: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@AchimGrolimund
Copy link

AchimGrolimund commented Apr 6, 2023

It is also on the Collector version: v0.73.0

and it is not only for the HPA... it is also related to the v1beta1.CronJob

See Example of my Logfile.
splunk-otel-collector-agent-96r7z-splunk-otel-collector-agent.log

@jvoravong
Copy link
Contributor Author

jvoravong commented Apr 7, 2023

@AchimGrolimund can you please provide more details about your Kubernetes environment?

I didn't see this issue in my Kops created Kubernetes 1.25 cluster. We have support for batchv1.CronJob so I'm wondering how this is happening.

@AchimGrolimund
Copy link

AchimGrolimund commented Apr 7, 2023 via email

@iblancasa
Copy link
Contributor

I can help supporting HorizontalPodAutoscaler v2

@AchimGrolimund
Copy link

AchimGrolimund commented May 3, 2023

@jvoravong
Sorry for my late reply.

We are currently using the following version:
https://github.com/signalfx/splunk-otel-collector/releases/tag/v0.76.0

$ oc version
Client Version: 4.12.0-202303081116.p0.g846602e.assembly.stream-846602e
Kustomize Version: v4.5.7
Server Version: 4.12.11
Kubernetes Version: v1.25.7+eab9cc9

and here still the logs:

...
2023-05-03T10:45:44.563Z info service/service.go:129 Starting otelcol... {"Version": "v0.76.0", "NumCPU": 16}
....
W0503 10:45:48.056292 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:48.056337 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:45:49.019103 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:49.019186 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:53.008856 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:53.008902 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:45:53.133807 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:53.133863 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:59.810228 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:59.810287 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:59.818576 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:59.818624 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:46:16.106509 1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:46:16.106555 1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource

Can we expect a solution soon?

@salapatt
Copy link

salapatt commented May 3, 2023

What is supported is batchv1.CronJob, but the question v1beta1.CronJob and v2beta1.HorizontalPodAutoscaler are taken care of in the code.

please provide an ETA

@AchimGrolimund
Copy link

Here some additional Informations:

$ oc get apirequestcounts -o jsonpath='{range .items[?(@.status.removedInRelease!="")]}{.status.removedInRelease}{"\t"}{.metadata.name}{"\n"}{end}' | sort
1.25    cronjobs.v1beta1.batch
1.25    horizontalpodautoscalers.v2beta1.autoscaling
1.26    horizontalpodautoscalers.v2beta2.autoscaling

@jvoravong
Copy link
Contributor Author

Looking into this, will get back here soon.

@salapatt
Copy link

salapatt commented May 4, 2023

Thanks @jvoravong I am the support engineer on this CASE 3182925, appreciate your help on this.

@jvoravong
Copy link
Contributor Author

I did miss adding a watcher for the HPA v2 code. Got a fix started for it. I verified k8s.hpa.* and k8s.job.* metrics are exported in Kubernetes 1.25 and 1.26.
Couldn't get the HPA warnings to stop though on 1.25 even with this last fix, I think it's due to how we watch for both versions of HPA.

@dmitryax
Copy link
Member

dmitryax commented May 5, 2023

Couldn't get the HPA warnings to stop though on 1.25 even with this last fix, I think it's due to how we watch for both versions of HPA.

That's fine. We have the same for jobs when both versions supported by the k8s API

@dmitryax
Copy link
Member

dmitryax commented May 5, 2023

Closing as resolved by #21497

@dmitryax dmitryax closed this as completed May 5, 2023
@dmitryax
Copy link
Member

dmitryax commented May 5, 2023

@AchimGrolimund, looking at the log output splunk-otel-collector-agent-96r7z-splunk-otel-collector-agent.log, it seems like the errors are coming from smartagent/openshift-cluster not from k8scluster receiver. Do you have k8scluster receiver enabled in the collector pipelines?

@AchimGrolimund
Copy link

AchimGrolimund commented May 5, 2023

Hey @dmitryax
Here is our Configmap:

---
# Source: splunk-otel-collector/templates/configmap-agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: splunk-otel-collector-agent-configmap
  namespace: xxxxxxxx-splunk-otel-collector
  labels:
    app: splunk-otel-collector-agent
data:
  relay: |
    exporters:
      sapm:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        endpoint: https://xxxxxx:443/ingest/v2/trace
      signalfx:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        api_url: https://xxxxxxx:443/api/
        correlation: null
        ingest_url: https://xxxxxxx:443/ingest/
        sync_host_metadata: true
    extensions:
      health_check: null
      k8s_observer:
        auth_type: serviceAccount
        node: ${K8S_NODE_NAME}
      memory_ballast:
        size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
      zpages: null
    processors:
      batch: null
      filter/logs:
        logs:
          exclude:
            match_type: strict
            resource_attributes:
            - key: splunk.com/exclude
              value: "true"
      groupbyattrs/logs:
        keys:
        - com.splunk.source
        - com.splunk.sourcetype
        - container.id
        - fluent.tag
        - istio_service_name
        - k8s.container.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.uid
      k8sattributes:
        extract:
          annotations:
          - from: pod
            key: splunk.com/sourcetype
          - from: namespace
            key: splunk.com/exclude
            tag_name: splunk.com/exclude
          - from: pod
            key: splunk.com/exclude
            tag_name: splunk.com/exclude
          - from: namespace
            key: splunk.com/index
            tag_name: com.splunk.index
          - from: pod
            key: splunk.com/index
            tag_name: com.splunk.index
          labels:
          - key: app
          metadata:
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - container.id
          - container.image.name
          - container.image.tag
        filter:
          node_from_env_var: K8S_NODE_NAME
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: ip
        - sources:
          - from: connection
        - sources:
          - from: resource_attribute
            name: host.name
      memory_limiter:
        check_interval: 2s
        limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
      resource:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
        - action: upsert
          key: k8s.cluster.name
          value: HCP-ROSA-PROD1
      resource/add_agent_k8s:
        attributes:
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/logs:
        attributes:
        - action: upsert
          from_attribute: k8s.pod.annotations.splunk.com/sourcetype
          key: com.splunk.sourcetype
        - action: delete
          key: k8s.pod.annotations.splunk.com/sourcetype
        - action: delete
          key: splunk.com/exclude
      resourcedetection:
        detectors:
        - env
        - ec2
        - system
        override: true
        timeout: 10s
    receivers:
      smartagent/openshift-cluster:
        type: openshift-cluster
        alwaysClusterReporter: true
        kubernetesAPI:
          authType: serviceAccount
        datapointsToExclude:
        - dimensions:
          metricNames:
            - '*appliedclusterquota*'
            - '*clusterquota*'
        extraMetrics:
          - kubernetes.container_cpu_request
          - kubernetes.container_memory_request
          - kubernetes.job.completions
          - kubernetes.job.active
          - kubernetes.job.succeeded
          - kubernetes.job.failed
      hostmetrics:
        collection_interval: 10s
        scrapers:
          cpu: null
          disk: null
          filesystem: null
          load: null
          memory: null
          network: null
          paging: null
          processes: null
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
      kubeletstats:
        auth_type: serviceAccount
        collection_interval: 10s
        endpoint: ${K8S_NODE_IP}:10250
        extra_metadata_labels:
        - container.id
        metric_groups:
        - container
        - pod
        - node
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      prometheus/agent:
        config:
          scrape_configs:
          - job_name: otel-agent
            scrape_interval: 10s
            static_configs:
            - targets:
              - 127.0.0.1:8889
      receiver_creator:
        receivers:
          smartagent/coredns:
            config:
              extraDimensions:
                metric_source: k8s-coredns
              port: 9154
              skipVerify: true
              type: coredns
              useHTTPS: true
              useServiceAccount: true
            rule: type == "pod" && namespace == "openshift-dns" && name contains "dns"
          smartagent/kube-controller-manager:
            config:
              extraDimensions:
                metric_source: kubernetes-controller-manager
              port: 10257
              skipVerify: true
              type: kube-controller-manager
              useHTTPS: true
              useServiceAccount: true
            rule: type == "pod" && labels["app"] == "kube-controller-manager" && labels["kube-controller-manager"]
              == "true"
          smartagent/kubernetes-apiserver:
            config:
              extraDimensions:
                metric_source: kubernetes-apiserver
              skipVerify: true
              type: kubernetes-apiserver
              useHTTPS: true
              useServiceAccount: true
            rule: type == "port" && port == 6443 && pod.labels["app"] == "openshift-kube-apiserver"
              && pod.labels["apiserver"] == "true"
          smartagent/kubernetes-proxy:
            config:
              extraDimensions:
                metric_source: kubernetes-proxy
              #port: 29101
              port: 9101
              useHTTPS: true
              skipVerify: true
              useServiceAccount: true
              type: kubernetes-proxy
            rule: type == "pod" && labels["app"] == "sdn"
          smartagent/kubernetes-scheduler:
            config:
              extraDimensions:
                metric_source: kubernetes-scheduler
              # port: 10251
              port: 10259
              type: kubernetes-scheduler
              useHTTPS: true
              skipVerify: true
              useServiceAccount: true
            rule: type == "pod" && labels["app"] == "openshift-kube-scheduler" && labels["scheduler"]
              == "true"
        watch_observers:
        - k8s_observer
      signalfx:
        endpoint: 0.0.0.0:9943
      smartagent/signalfx-forwarder:
        listenAddress: 0.0.0.0:9080
        type: signalfx-forwarder
      zipkin:
        endpoint: 0.0.0.0:9411
    service:
      extensions:
      - health_check
      - k8s_observer
      - memory_ballast
      - zpages
      pipelines:
        metrics:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resourcedetection
          - resource
          receivers:
          - hostmetrics
          - kubeletstats
          - otlp
          - receiver_creator
          - signalfx
          - smartagent/openshift-cluster
        metrics/agent:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource/add_agent_k8s
          - resourcedetection
          - resource
          receivers:
          - prometheus/agent
        traces:
          exporters:
          - sapm
          - signalfx
          processors:
          - memory_limiter
          - k8sattributes
          - batch
          - resourcedetection
          - resource
          receivers:
          - otlp
          - jaeger
          - smartagent/signalfx-forwarder
          - zipkin
      telemetry:
        metrics:
          address: 127.0.0.1:8889

Best Regards Achim

@dmitryax
Copy link
Member

dmitryax commented May 5, 2023

@AchimGrolimund Thank you. This is coming from smartagent/openshift-cluster. So it's unrelated to this issue and has to be solved separately. @jvoravong can you please follow up on this? I'm not sure if we have an OTel native receiver to replace it with

@dmitryax
Copy link
Member

dmitryax commented May 5, 2023

Looks like k8scluster receiver supports scraping additional OpenShift metrics https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sclusterreceiver#openshift, but it should be run separately as 1-replica deployment. @AchimGrolimund did you try it by chance?

@Borrelworst
Copy link

Borrelworst commented May 16, 2023

Just to add, in case of Azure you will not be able to upgrade from 1.25.* to 1.26.* as the agent is still querying the v2beta2 autoscaler API. As Azure prevents upgrading when deprecated API's are still being used the upgrade fails. You either have to force the upgrade, or remove the signalfx agent, wait for 12hours and then try again.

Would be nice if the agent checks the kubernetes version, if higher then 1.25 then do not monitoring the /apis/autoscaling/v2beta2/horizontalpodautoscalers api endpoint.

@salapatt
Copy link

salapatt commented May 22, 2023

The customer xxx updated the Splunk OTC agent to version 0.77.0 and still gets the same error messages.

W0522 06:11:24.226426 1 reflector.go:533] k8s.io/[email protected]/tools/cache/reflector.go:231[mailto:k8s.io/[email protected]/tools/cache/reflector.go:231](mailto:%5Bk8s.io/[email protected]/tools/cache/reflector.go:231%5D(https://k8s.io/[email protected]/tools/cache/reflector.go:231)): failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
129E0522 06:11:24.226454 1 reflector.go:148] k8s.io/[email protected]/tools/cache/reflector.go:231[mailto:k8s.io/[email protected]/tools/cache/reflector.go:231](mailto:%5Bk8s.io/[email protected]/tools/cache/reflector.go:231%5D(https://k8s.io/[email protected]/tools/cache/reflector.go:231)): Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource

@jvoravong
Copy link
Contributor Author

jvoravong commented Sep 12, 2023

Update on Deprecated Endpoint Removal:

  • A community contribution in release v0.85.0 removed the scanning of deprecated endpoints.
  • Impact: Clusters running Kubernetes v1.22 and below will no longer be able to collect metrics from these endpoints when using collector version v0.85.0 and above.

Additional Context:

  • Kubernetes Support: Core Kubernetes ceased active support for version 1.22 on August 28, 2022 (source).
  • EKS Support: Amazon EKS, known for a broader version support matrix, discontinued active support for version 1.22 on June 4, 2023 (source).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/k8scluster
Projects
None yet
Development

No branches or pull requests

7 participants