Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes = null #1691

Closed
mflococo opened this issue Oct 29, 2019 · 22 comments
Closed

kubernetes = null #1691

mflococo opened this issue Oct 29, 2019 · 22 comments
Assignees

Comments

@mflococo
Copy link

mflococo commented Oct 29, 2019

Bug Report

Describe the bug

Intermittently across multiple instances of our aws eks clusters logs are arriving with the kubernetes enrichment and annotations missing, the entire key is null.

We have tried restarting the fluent-bit pods, deleting the daemonset, creating new clusters and sometimes it works (arrives with k8s enrichment and annotations) others it does not.

To Reproduce

  • Example log message if applicable:
   @i: 33b65867
   @m: Waiting 1499 seconds to renew Vault token.
   @t: 2019-10-29T14:25:31.9918803Z
   delaySeconds: 1499
   environment: Union
   kubernetes: null
   stream: stdout
   time: 2019-10-29T14:25:31.992070193Z
   version: 0.2.2215.24190
  • Steps to reproduce the problem:

we haven't been able to pinpoint the issue or consistently reproduce it.

Expected behavior

When we define the input, filter and parser to extract kubernetes meta data and annotations from our logs and api it is added to our log file.

Screenshots

Your Environment

spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: fluent-bit
      containers:
      - name: aws-for-fluent-bit
        image: amazon/aws-for-fluent-bit:1.3.2
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: mnt
          mountPath: /mnt
          readOnly: true
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 500m
            memory: 100Mi
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: mnt
        hostPath:
          path: /mnt
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
  labels:
    app.kubernetes.io/name: fluentbit
data:
  fluent-bit.conf: |
    [SERVICE]
        Parsers_File  parsers.conf
        HTTP_Server  On
        HTTP_Listen  0.0.0.0
        HTTP_PORT    2020
    [INPUT]
        Name              tail
        # namespace_name, container_name, pod_name, docker_id must be present in the tag
        # as they are required by the kubernetes filter
        # This regex cribbed from https://github.com/fluent/fluent-bit/blob/v1.0.2/plugins/filter_kubernetes/kube_regex.h#L25
        Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Tag               k8s_containers.<namespace_name>.<container_name>.<pod_name>.<docker_id>-
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        DB.Sync           Normal
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   2M
        Mem_Buf_Limit     256MB
        Skip_Long_Lines   On
        Refresh_Interval  2
        Docker_Mode       On
    [FILTER]
        Name           kubernetes
        Match          *
        Kube_URL       https://kubernetes.default.svc:443
        Merge_Log      On
        Keep_Log       Off
        Annotations    On
        tls.verify     Off
        Regex_Parser   k8s-custom-tag
        Kube_Tag_Prefix k8s_containers.

    # Add logging to cloudwatch by looping through Namespaces
    [OUTPUT]
        Alias kube-system
        Name cloudwatch
        Match k8s_containers.kube-system*
        region us-east-1
        log_group_name /eks/test/kube-system
        log_stream_prefix kube-system-
        auto_create_group true

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%Z
        Time_Keep    On
        # Command      |  Decoder | Field | Optional Action
        # =============|==================|=================
        # Decode_Field_As   escaped_utf8    log    do_next
        Decode_Field_As   json       log
    [PARSER]
        # Kubernetes filter must be able to parse these fields out of the tag to work
        Name    k8s-custom-tag
        Format  regex
        Regex   (?<namespace_name>[^_]+)\.(?<container_name>.+)\.(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)\.(?<docker_id>[a-z0-9]{64})-$
---
# Service added for ServiceMonitor (Prometheus)
apiVersion: v1
kind: Service
metadata:
  name: fluentbit
  namespace: kube-system
  labels:
    app: fluentbit
spec:
  selector:
    name: fluentbit
  ports:
  - port: 2020
    protocol: TCP
    targetPort: 2020
    name: metrics

  • Environment name and version (e.g. Kubernetes? What version?): AWS EKS Kubernetes version 1.13
  • Server type and version: AWS EKS ec2 t3.medium
  • Operating System and version: amazon-eks-node-1.13-v20190927 (ami-0990970b056c619eb)
  • Filters and plugins: kubernetes

Additional context

software development is severely impacted by not being able to trace where in the system things are failing. It also affects analytics as values we defined in our queries are missing.

@computeracer
Copy link

Here is another example when outputting fluentbit logs to file:

k8s_containers.kube-system.aws-for-fluent-bit.fluentbit-7g649.c311243e64905c4f193c7bca600399fa8c7f8b48abfe15f4a41dfe11e196ea45-: [1572373624.454250, {"log":"2019-10-29T18:27:04.441400237Z stderr F [2019/10/29 18:27:04] [debug] [filter_kube] API Server (ns=kube-system, pod=kube-proxy-lvt7v) http_do=0, HTTP Status: 200","kubernetes":null}]

@computeracer
Copy link

computeracer commented Oct 30, 2019

Interesting to note is when I manually give the daemonset aws creds, and run it in a kind cluster, I still get intermittent "kubernetes":null. I also tried changing output to file, and again intermittent failures.
Any thoughts @kantica and @edsiper ?

@computeracer
Copy link

Today we greatly simplified the config, and deployed (kubectl apply -f) it in a local kind cluster.

We seem to consistently not get k8s annotations ("kubernetes": null) on the fluent-bit pod, but the others have all the data under fluent bit. Hope this may help get to the root of the issue.

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: pod-log-reader
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: pod-log-crb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pod-log-reader
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentbit
  namespace: kube-system
  labels:
    app.kubernetes.io/name: fluentbit
spec:
  selector:
    matchLabels:
      name: fluentbit
  template:
    metadata:
      labels:
        name: fluentbit
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: fluent-bit
      containers:
      - name: aws-for-fluent-bit
        image: amazon/aws-for-fluent-bit:1.3.2
        env:
        - name: FLB_LOG_LEVEL
          value: debug
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: mnt
          mountPath: /mnt
          readOnly: true
        - name: cache
          mountPath: /fluent-bit/cache
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 500m
            memory: 100Mi
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: mnt
        hostPath:
          path: /mnt
      - name: cache
        emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
  labels:
    app.kubernetes.io/name: fluentbit
data:
  fluent-bit.conf: |
    [SERVICE]
        Parsers_File  parsers.conf
        HTTP_Server  On
        HTTP_Listen  0.0.0.0
        HTTP_PORT    2020
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     256MB
    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_URL https://kubernetes.default.svc:443
        Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix kube.var.log.containers.
        Merge_Log On
        Merge_Log_Key log_processed
        Kube_meta_preload_cache_dir /fluent-bit/cache

    [OUTPUT]
        Alias kube-system
        Name file
        Match kube.*
        Path /var/log/kube-system.log

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

@jhuntwork
Copy link
Contributor

This issue has been extremely difficult to consistently reproduce. Only sometimes does the request to the API seem to produce null annotations. And because fluent-bit caches the metadata for a pod/namespace combination, it can take a while to encounter it.

Here's a method to eventually trigger the issue:

  1. Create a kind cluster
kind create cluster
export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
  1. Check out the fluent-bit source and make one small adjustment so that metadata is always fetched from the api instead of just using the cached result:
diff --git a/plugins/filter_kubernetes/kube_meta.c b/plugins/filter_kubernetes/kube_meta.c
index 0d3df7e7..e2a59529 100644
--- a/plugins/filter_kubernetes/kube_meta.c
+++ b/plugins/filter_kubernetes/kube_meta.c
@@ -974,7 +974,7 @@ int flb_kube_meta_get(struct flb_kube *ctx,
     ret = flb_hash_get(ctx->hash_table,
                        meta->cache_key, meta->cache_key_len,
                        &hash_meta_buf, &hash_meta_size);
-    if (ret == -1) {
+    if (-1 == -1) {
         /* Retrieve API server meta and merge with local meta */
         ret = get_and_merge_meta(ctx, meta,
                                  &tmp_hash_meta_buf, &hash_meta_size);
  1. Build a docker image with the above change, and load it into kind
docker build -t fluent-bit:test .
kind load docker-image fluent-bit:test
  1. Launch fluent-bit via kubectl. The following spec should represent a pretty minimal setup:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: pod-log-reader
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: pod-log-crb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pod-log-reader
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentbit
  namespace: kube-system
  labels:
    app.kubernetes.io/name: fluentbit
spec:
  selector:
    matchLabels:
      name: fluentbit
  template:
    metadata:
      labels:
        name: fluentbit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: fluent-bit
        image: fluent-bit:test
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
  labels:
    app.kubernetes.io/name: fluentbit
data:
  fluent-bit.conf: |
    [SERVICE]
        Parsers_File  parsers.conf
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Refresh_Interval  2
    [FILTER]
        Name kubernetes
        Kube_Tag_Prefix kube.var.log.containers.
        Match kube.*
        Merge_Log On
    [OUTPUT]
        Alias kube-system
        Name file
        Match kube.*
        Path /var/log/fluent-bit-output.log
  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
  1. Get shell access into the kind control-plane container and tail the output log.
# First grab the container id of the control plane
$ docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                                  NAMES
cc590a19909f        kindest/node:v1.15.3   "/usr/local/bin/entr…"   15 minutes ago      Up 15 minutes       64283/tcp, 127.0.0.1:64283->6443/tcp   kind-control-plane
# Then docker exec
$ docker exec -it cc590a19909f /bin/bash
# Inside the control-plane container, tail the log
tail -f /var/log/fluent-bit-output.log | grep -iE 'null|filter'
  1. In another terminal session on the same host system, continually loop through creating a pod with some basic output and deleting it.
export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
while true; do \
    kubectl run -it --rm --generator=run-pod/v1 --image=alpine alpine -- /bin/sh -c 'for i in 1 2 3 4 5; do date; sleep 1; done'; \
    sleep 1; \
done

Eventually, you will see logs from the output being tailed that look like this:

kube.var.log.containers.alpine_default_alpine-c40b7545867c821b3dfe0b398983e636132c44a98ab9ea8463d83b76206d77cd.log: [1573048310.008641, {"log":"2019-11-06T13:51:49.0348335Z stdout F Wed Nov  6 13:51:49 UTC 2019","kubernetes":null}]
kube.var.log.containers.alpine_default_alpine-c40b7545867c821b3dfe0b398983e636132c44a98ab9ea8463d83b76206d77cd.log: [1573048310.036044, {"log":"2019-11-06T13:51:50.0358045Z stdout F Wed Nov  6 13:51:50 UTC 2019","kubernetes":null}]

@gageorsburn
Copy link

I seem to be hitting this as well. Not sure what causes it.

@kantica
Copy link
Contributor

kantica commented Nov 8, 2019

I have found what is causing this, and its inside kube_meta.c: extract_container_hash()

Baiscally part of extract_container_hash() which looks at "containerID" and "imageID" does not work correctly on for example kind cluster. Therefore there is corruption of kubernetes messagepack metadata map.

What is going on?
Inside kube_meta.c following is defined:

#define FLB_KUBE_META_CONTAINER_STATUSES_KEY "containerStatuses"
#define FLB_KUBE_META_CONTAINER_STATUSES_KEY_LEN \
    (sizeof(FLB_KUBE_META_CONTAINER_STATUSES_KEY) - 1)
#define FLB_KUBE_META_INIT_CONTAINER_STATUSES_KEY "initContainerStatuses"
#define FLB_KUBE_META_INIT_CONTAINER_STATUSES_KEY_LEN \
    (sizeof(FLB_KUBE_META_INIT_CONTAINER_STATUSES_KEY) - 1)
#define FLB_KUBE_META_CONTAINER_ID_PREFIX "docker:https://"
#define FLB_KUBE_META_CONTAINER_ID_PREFIX_LEN \
    (sizeof(FLB_KUBE_META_CONTAINER_ID_PREFIX) - 1)
#define FLB_KUBE_META_IMAGE_ID_PREFIX "docker-pullable:https://"
#define FLB_KUBE_META_IMAGE_ID_PREFIX_LEN \
    (sizeof(FLB_KUBE_META_IMAGE_ID_PREFIX) - 1)

CONTAINER_ID_PREFIX is defined as "docker:https://"
IMAGE_ID_PREFIX is defined as "docker-pullable:https://"

Looking at (kind) cluster metadata from kube-api server, following can be observerd:

# curl -s -k -H "Authorization: Bearer $token" https://kubernetes.default.svc.cluster.local/api/v1/namespaces/default/pods | jq '.items[].status.containerStatuses[] | "\(.containerID) \(.imageID)"'
"containerd:https://1d13ee4a35ca70f86cef420871d4a552ad7b14e251265dcf23595add22508247 docker.io/library/alpine@sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a"
"containerd:https://9d53fd9649c4ff7e67f73e26f7c1e13fc14caa2417bfc3694f567645f5338620 docker.io/library/alpine@sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a"
# curl -s -k -H "Authorization: Bearer $token" https://kubernetes.default.svc.cluster.local/api/v1/namespaces/default/pods | egrep '(containerID|imageID)'
            "imageID": ""
            "imageID": "docker.io/library/alpine@sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a",
            "containerID": "containerd:https://1ed452c3d02d8114e5739375e1d67405fa40c2f78797ec2d8e5bd13740424273"
                "containerID": "containerd:https://1efbaa7537602817fbf902570011db53097fed4b45301017f4c2b5691bb63d47"
            "imageID": "docker.io/library/alpine@sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a",
            "containerID": "containerd:https://1efbaa7537602817fbf902570011db53097fed4b45301017f4c2b5691bb63d47"

Clearly expected IMAGE_ID_PREFIX is not "docker-pullable:https://" and expected CONTAINER_ID_PREFIX is "containerd:https://" insted of defined "docker:https://".

This is relevant code part inside kube_meta.c is https://github.com/fluent/fluent-bit/blob/master/plugins/filter_kubernetes/kube_meta.c#L385-L400

If you comment that two else if blocks the issue should be gone.
IMHO IMAGE_ID and CONTAINER_ID prefixes should be configurable with sane defaults, and relevant code parts better/more robust (re)written.

@kantica
Copy link
Contributor

kantica commented Nov 8, 2019

Here is quick-and-dirty patch which should probably work so that metadata msgpack map is not corrupted :)

diff --git a/plugins/filter_kubernetes/kube_meta.c b/plugins/filter_kubernetes/kube_meta.c
index 0d3df7e7..790c53ca 100644
--- a/plugins/filter_kubernetes/kube_meta.c
+++ b/plugins/filter_kubernetes/kube_meta.c
@@ -395,8 +395,10 @@ static void extract_container_hash(struct flb_kube_meta *meta,
                                        "imageID",
                                        k2.via.str.size)) {
                         /* Strip "docker-pullable:https://" prefix */
-                        container_hash = v2.ptr + FLB_KUBE_META_IMAGE_ID_PREFIX_LEN;
-                        container_hash_len = v2.size - FLB_KUBE_META_IMAGE_ID_PREFIX_LEN;
+                        if (v2.size >= FLB_KUBE_META_IMAGE_ID_PREFIX_LEN) {
+                            container_hash = v2.ptr + FLB_KUBE_META_IMAGE_ID_PREFIX_LEN;
+                            container_hash_len = v2.size - FLB_KUBE_META_IMAGE_ID_PREFIX_LEN;
+                        }
                     }
                 }
                 if (name_found) {

@edsiper
Copy link
Member

edsiper commented Nov 8, 2019

thanks everyone! what a team work :) , based on the feedback and the solution provided by @kantica I have pushed the followng test image:

edsiper/fluent-bit-1.3.3-next:3

please give it a try, if it looks good we can do a release TODAY. please test it!

@computeracer
Copy link

computeracer commented Nov 8, 2019

Tested with edsiper/fluent-bit-1.3.3-next:3. Could not get a null. Well done.

I deleted and applied my deployment referenced earlier several times with the new image and could not reproduce a null in the logs of my kind cluster. Thank you!

edsiper added a commit that referenced this issue Nov 8, 2019
The following patch fix the container hash lookup, there are certain
deployments where containers prefix are not docker:https:// or
docker-pullable:https:// like when 'kind' is used.

The fix works as failover mechanism skipping the prefix and consuming
the string after the ':https://' separator.

This fix is based in the solution provided by @kantica.

note: kudos to all people involved on this issue that helper to
troubleshoot, find the root cause of the problem and provide
workarounds.

Signed-off-by: Eduardo Silva <[email protected]>
@ivanfoo
Copy link

ivanfoo commented Nov 11, 2019

We are testing the patch too, it seems we could not get any null (so far)

kantica added a commit to kantica/fluent-bit that referenced this issue Nov 12, 2019
This fixes containerID and imageID hash lookup.
There was error in processing/extracting hash from containerID or
imageID if strings do not begin with docker:https:// or docker-pullable:https://
Implication of this would be that kubernetes metadata structure is
"null" at output.

In some cases imageID comes also as empty string from kubernetes api.

Example cases:
* "imageID": ""
* "imageID": "sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c"
* "imageID": "docker:https://sha256:2b424891d78a4704aa9d763b274f1be6752766d24463b9e3b5411a7bac2207ca"
* "imageID": "docker.io/library/debian@sha256:41f76363fd83982e14f7644486e1fb04812b3894aa4e396137c3435eaf05de88"
* "imageID": "docker-pullable:https://redis@sha256:4be7fdb131e76a6c6231e820c60b8b12938cf1ff3d437da4871b9b2440f4e385"
* "imageID": "docker-pullable:https://k8s.gcr.io/event-exporter@sha256:16ca66e2b5dc7a1ce6a5aafcb21d0885828b75cdfc08135430480f7ad2364adc"
* "imageID": "docker-pullable:https://kubedb/operator@sha256:063a92a47de7c4b25e3bc382d878513564c65c5313b347bdbcfc60959d082b22"
* "imageID": "docker-pullable:https://quay.io/jetstack/cert-manager-controller@sha256:66eb65ac8ff3505310e850e0486c4dd82b26e6cfdc7afd7456e0fab753b57855"
* "containerID": "docker:https://e8d482336de6b610b9084e9e74b0213c5cc7d399053016448319a4970a16324e"
* "containerID": "containerd:https://1ed452c3d02d8114e5739375e1d67405fa40c2f78797ec2d8e5bd13740424273"
* "containerID": "cri-o:https://fc9f90e123ca14e4d000c9d4c854bb54d43ce96b1624d51bb4b97afd1c728229"

Obviously value of containerID is dependent on container runtime used by
kubernetes cluster.

This patch tries to find relative positions of last ":" and "/" to the
end of the string.
Therefore only hash is extracted and everything preceding it, is
discarded.

Signed-off-by: kantica <[email protected]>
edsiper added a commit that referenced this issue Nov 12, 2019
The following patch fix the container hash lookup, there are certain
deployments where containers prefix are not docker:https:// or
docker-pullable:https:// like when 'kind' is used.

The fix works as failover mechanism skipping the prefix and consuming
the string after the ':https://' separator.

This fix is based in the solution provided by @kantica.

note: kudos to all people involved on this issue that helper to
troubleshoot, find the root cause of the problem and provide
workarounds.

Signed-off-by: Eduardo Silva <[email protected]>
kantica added a commit to kantica/fluent-bit that referenced this issue Nov 12, 2019
This fixes containerID and imageID hash lookup.
There was error in processing/extracting hash from containerID or
imageID if strings do not begin with docker:https:// or docker-pullable:https://
Implication of this would be that kubernetes metadata structure is
"null" at output.

In some cases imageID comes also as empty string from kubernetes api.

Example cases:
* "imageID": ""
* "imageID": "sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c"
* "imageID": "docker:https://sha256:2b424891d78a4704aa9d763b274f1be6752766d24463b9e3b5411a7bac2207ca"
* "imageID": "docker.io/library/debian@sha256:41f76363fd83982e14f7644486e1fb04812b3894aa4e396137c3435eaf05de88"
* "imageID": "docker-pullable:https://redis@sha256:4be7fdb131e76a6c6231e820c60b8b12938cf1ff3d437da4871b9b2440f4e385"
* "imageID": "docker-pullable:https://k8s.gcr.io/event-exporter@sha256:16ca66e2b5dc7a1ce6a5aafcb21d0885828b75cdfc08135430480f7ad2364adc"
* "imageID": "docker-pullable:https://kubedb/operator@sha256:063a92a47de7c4b25e3bc382d878513564c65c5313b347bdbcfc60959d082b22"
* "imageID": "docker-pullable:https://quay.io/jetstack/cert-manager-controller@sha256:66eb65ac8ff3505310e850e0486c4dd82b26e6cfdc7afd7456e0fab753b57855"
* "containerID": "docker:https://e8d482336de6b610b9084e9e74b0213c5cc7d399053016448319a4970a16324e"
* "containerID": "containerd:https://1ed452c3d02d8114e5739375e1d67405fa40c2f78797ec2d8e5bd13740424273"
* "containerID": "cri-o:https://fc9f90e123ca14e4d000c9d4c854bb54d43ce96b1624d51bb4b97afd1c728229"

Obviously value of containerID is dependent on container runtime used by
kubernetes cluster.

This patch tries to find relative positions of last ":" and "/" to the
end of the string.
Therefore only hash is extracted and everything preceding it, is
discarded.

Signed-off-by: kantica <[email protected]>
kantica added a commit to kantica/fluent-bit that referenced this issue Nov 12, 2019
This commit reworks last patch for this issue and rebases to current
master with patch from @edsiper for the same issue.
Rework avoids double usage of memrchr() for same string for matching "/"
from 5844b9f.

Signed-off-by: kantica <[email protected]>
edsiper pushed a commit that referenced this issue Nov 13, 2019
* filter_kubernetes: fix container hash lookup (#1691)

This fixes containerID and imageID hash lookup.
There was error in processing/extracting hash from containerID or
imageID if strings do not begin with docker:https:// or docker-pullable:https://
Implication of this would be that kubernetes metadata structure is
"null" at output.

In some cases imageID comes also as empty string from kubernetes api.

Example cases:
* "imageID": ""
* "imageID": "sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c"
* "imageID": "docker:https://sha256:2b424891d78a4704aa9d763b274f1be6752766d24463b9e3b5411a7bac2207ca"
* "imageID": "docker.io/library/debian@sha256:41f76363fd83982e14f7644486e1fb04812b3894aa4e396137c3435eaf05de88"
* "imageID": "docker-pullable:https://redis@sha256:4be7fdb131e76a6c6231e820c60b8b12938cf1ff3d437da4871b9b2440f4e385"
* "imageID": "docker-pullable:https://k8s.gcr.io/event-exporter@sha256:16ca66e2b5dc7a1ce6a5aafcb21d0885828b75cdfc08135430480f7ad2364adc"
* "imageID": "docker-pullable:https://kubedb/operator@sha256:063a92a47de7c4b25e3bc382d878513564c65c5313b347bdbcfc60959d082b22"
* "imageID": "docker-pullable:https://quay.io/jetstack/cert-manager-controller@sha256:66eb65ac8ff3505310e850e0486c4dd82b26e6cfdc7afd7456e0fab753b57855"
* "containerID": "docker:https://e8d482336de6b610b9084e9e74b0213c5cc7d399053016448319a4970a16324e"
* "containerID": "containerd:https://1ed452c3d02d8114e5739375e1d67405fa40c2f78797ec2d8e5bd13740424273"
* "containerID": "cri-o:https://fc9f90e123ca14e4d000c9d4c854bb54d43ce96b1624d51bb4b97afd1c728229"

Obviously value of containerID is dependent on container runtime used by
kubernetes cluster.

This patch tries to find relative positions of last ":" and "/" to the
end of the string.
Therefore only hash is extracted and everything preceding it, is
discarded.

Signed-off-by: kantica <[email protected]>

* filter_kubernetes: fix container hash lookup (#1691)

This commit reworks last patch for this issue and rebases to current
master with patch from @edsiper for the same issue.
Rework avoids double usage of memrchr() for same string for matching "/"
from 5844b9f.

Signed-off-by: kantica <[email protected]>
kantica added a commit to kantica/fluent-bit that referenced this issue Nov 13, 2019
This fixes tests for container_hash kubernetes meta.
Tests are for fixes made in fluent#1691 and PR fluent#1731.

Signed-off-by: kantica <[email protected]>
edsiper pushed a commit that referenced this issue Nov 13, 2019
This fixes tests for container_hash kubernetes meta.
Tests are for fixes made in #1691 and PR #1731.

Signed-off-by: kantica <[email protected]>
@jhuntwork
Copy link
Contributor

When we applied just d62bd88 to our custom image, we still eventually saw some nulls. But with current master (now at cc6b7bb) we haven't seen any in the last 2 hours.

edsiper added a commit that referenced this issue Nov 20, 2019
The following patch fix the container hash lookup, there are certain
deployments where containers prefix are not docker:https:// or
docker-pullable:https:// like when 'kind' is used.

The fix works as failover mechanism skipping the prefix and consuming
the string after the ':https://' separator.

This fix is based in the solution provided by @kantica.

note: kudos to all people involved on this issue that helper to
troubleshoot, find the root cause of the problem and provide
workarounds.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper pushed a commit that referenced this issue Nov 20, 2019
* filter_kubernetes: fix container hash lookup (#1691)

This fixes containerID and imageID hash lookup.
There was error in processing/extracting hash from containerID or
imageID if strings do not begin with docker:https:// or docker-pullable:https://
Implication of this would be that kubernetes metadata structure is
"null" at output.

In some cases imageID comes also as empty string from kubernetes api.

Example cases:
* "imageID": ""
* "imageID": "sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c"
* "imageID": "docker:https://sha256:2b424891d78a4704aa9d763b274f1be6752766d24463b9e3b5411a7bac2207ca"
* "imageID": "docker.io/library/debian@sha256:41f76363fd83982e14f7644486e1fb04812b3894aa4e396137c3435eaf05de88"
* "imageID": "docker-pullable:https://redis@sha256:4be7fdb131e76a6c6231e820c60b8b12938cf1ff3d437da4871b9b2440f4e385"
* "imageID": "docker-pullable:https://k8s.gcr.io/event-exporter@sha256:16ca66e2b5dc7a1ce6a5aafcb21d0885828b75cdfc08135430480f7ad2364adc"
* "imageID": "docker-pullable:https://kubedb/operator@sha256:063a92a47de7c4b25e3bc382d878513564c65c5313b347bdbcfc60959d082b22"
* "imageID": "docker-pullable:https://quay.io/jetstack/cert-manager-controller@sha256:66eb65ac8ff3505310e850e0486c4dd82b26e6cfdc7afd7456e0fab753b57855"
* "containerID": "docker:https://e8d482336de6b610b9084e9e74b0213c5cc7d399053016448319a4970a16324e"
* "containerID": "containerd:https://1ed452c3d02d8114e5739375e1d67405fa40c2f78797ec2d8e5bd13740424273"
* "containerID": "cri-o:https://fc9f90e123ca14e4d000c9d4c854bb54d43ce96b1624d51bb4b97afd1c728229"

Obviously value of containerID is dependent on container runtime used by
kubernetes cluster.

This patch tries to find relative positions of last ":" and "/" to the
end of the string.
Therefore only hash is extracted and everything preceding it, is
discarded.

Signed-off-by: kantica <[email protected]>

* filter_kubernetes: fix container hash lookup (#1691)

This commit reworks last patch for this issue and rebases to current
master with patch from @edsiper for the same issue.
Rework avoids double usage of memrchr() for same string for matching "/"
from 5844b9f.

Signed-off-by: kantica <[email protected]>
edsiper pushed a commit that referenced this issue Nov 20, 2019
This fixes tests for container_hash kubernetes meta.
Tests are for fixes made in #1691 and PR #1731.

Signed-off-by: kantica <[email protected]>
@edsiper
Copy link
Member

edsiper commented Nov 20, 2019

The new version is coming out today, packages are already out and docker images will take a few more hours to be ready, thanks for your patience

@amkartashov
Copy link
Contributor

amkartashov commented Nov 21, 2019

@edsiper

with this fix, container_hash contains hash only, so imageID: docker-pullable:https://bitnami/mongodb@sha256:a93f75fec367ea5393f46f6050dea9747053f694a2659da44a9e9e06cf2cd039 leads to container_hash:"a93f75fec367ea5393f46f6050dea9747053f694a2659da44a9e9e06cf2cd039". while it is expected to strip only docker-pullable:https://

I think memrchr should be replaced with memchr here https://github.com/fluent/fluent-bit/blob/master/plugins/filter_kubernetes/kube_meta.c#L325 (so to search for the leftmost colon)

EDIT: pls ignore, @kantica explained that this is desired change to only have hash.

@edsiper
Copy link
Member

edsiper commented Nov 23, 2019

@kantica can you submit a new PR to fix the use case mentioned above ?

It would be good to include this as part of the next v1.3.4 release

@kantica
Copy link
Contributor

kantica commented Nov 25, 2019

Ok, I'll try to do it today.

kantica added a commit to kantica/fluent-bit that referenced this issue Nov 25, 2019
This commit reworks hash lookup as reported by @gorilych
in fluent#1691 (comment)

Test cases are also updated accordingly.

Signed-off-by: kantica <[email protected]>
kantica added a commit to kantica/fluent-bit that referenced this issue Nov 26, 2019
This commit reworks hash lookup as reported by @gorilych
in fluent#1691 (comment).
Based on current master 8e071e3

Test cases are also updated accordingly.

Signed-off-by: kantica <[email protected]>
edsiper pushed a commit that referenced this issue Nov 26, 2019
This commit reworks hash lookup as reported by @gorilych
in #1691 (comment).
Based on current master 8e071e3

Test cases are also updated accordingly.

Signed-off-by: kantica <[email protected]>
@beckjim
Copy link

beckjim commented Feb 3, 2020

As of my understanding, this is not yet in the 1.3 branch. Are there any plans to do this?
Thanks for working on this!

@eddycharly
Copy link

Any plan to release a version including a fix ?

@edsiper
Copy link
Member

edsiper commented Feb 19, 2020

@kantica I tried to rebase on top of 1.3 but there are some conflicts. Is it possible to backport this?

@kantica
Copy link
Contributor

kantica commented Feb 19, 2020

Ok I'll rebase/patch this issue in 1.3 branch. I'll get back as soon as I have it there.

@eddycharly
Copy link

@edsiper is there a nightly build or something we can use, until a release including the fix is available ?

kantica added a commit to kantica/fluent-bit that referenced this issue Mar 19, 2020
@kantica
Copy link
Contributor

kantica commented Mar 19, 2020

Sorry for long wait, could not make it earlier.

edsiper pushed a commit that referenced this issue Mar 19, 2020
This is backport of master (51ac9f5)

Signed-off-by: kantica <[email protected]>
@edsiper
Copy link
Member

edsiper commented Mar 19, 2020

thanks @kantica ! just in time for v1.3.11 release today :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants