Skip to content
This repository has been archived by the owner on Jun 7, 2023. It is now read-only.

ENV 'FLT_METRICS_LIST' and timestamps looks incorrectly parsed #160

Closed
petr-r94 opened this issue Sep 3, 2021 · 7 comments
Closed

ENV 'FLT_METRICS_LIST' and timestamps looks incorrectly parsed #160

petr-r94 opened this issue Sep 3, 2021 · 7 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@petr-r94
Copy link

petr-r94 commented Sep 3, 2021

Hello!

I have conducted research how pad is handling configuration. So, I deployed pad as Container (podman) and also in OpenShift with different methods of filling environment variables and would like to share my two big concerns. Concern 1 Affected both deploy models and Concern 2 only in OpenShift. I think this is a bug, but if it is meant to be normally, please clarify. Hope it helps to improve this tool.

Here is ConfigMap test samples which I used for pad Deployment

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pad-config
  namespace: pad-test
data:
  APP_FILE: 'app.py'
  FLT_PROM_URL: 'http:https://victoria-metrics01.cc.local:8428'
  FLT_METRICS_LIST: >-
    node_load15{instance="master0.openshift-46.cc.local", job="node-exporter"}
  FLT_RETRAINING_INTERVAL_MINUTES: '30'
  FLT_DATA_START_TIME: '1d'
  FLT_ROLLING_TRAINING_WINDOW_SIZE: '7d'
  FLT_DEBUG_MODE: 'true'

Or with two metrics with regex

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pad-config
  namespace: pad-test
data:
  APP_FILE: 'app.py'
  FLT_PROM_URL: 'http:https://victoria-metrics01.cc.local:8428'
  FLT_METRICS_LIST: >-
    node_load1{instance=~"master[0-1]{1}.openshift-46.cc.local", job="node-exporter"};
    node_load15{instance=~"master[0-1]{1}.openshift-46.cc.local", job="node-exporter"}
  FLT_RETRAINING_INTERVAL_MINUTES: '30'
  FLT_DATA_START_TIME: '1d'
  FLT_ROLLING_TRAINING_WINDOW_SIZE: '10d'
  FLT_DEBUG_MODE: 'true'

And only with such values it's working OK, but it will be broken without any warning if

1. Quotes in timestamp variables

  • For instance FLT_ROLLING_TRAINING_WINDOW_SIZE We can fill in '7d' or '7' or just 7 - In this ENV it must be always with suffix 'd' but it's a little confusing because in other variable we are using digit without suffix (e.g. FLT_RETRAINING_INTERVAL_MINUTES: '30').
  • Also in OpenShift it must be always denoted with quotes else OpenShift will throw an error "Cannot convert int64 to string"

it leads to incorrect rolling training window size (4096 days instead of 7 for example) and it shown in pad logs but no any error and web server simple does not start (503). Running curl localhost:8080 inside container return connection refused. I also tested it without host network via standart bridge mode with port mapping and got the same results.

So, I think it should be processed in more consistent way and maybe apply some validation of variables in python script.

2. Incorrect Parsing of value from FLT_METRICS_LIST

If we're using yaml-folded style with quotes like this, it leads pad to just start web server and no make any calculations therefore we can not see any *_prophet metrics from webServer endpoint.

  FLT_METRICS_LIST: >-
    'node_load1{instance=~"master[0-1]{1}.openshift-46.cc.local", job="node-exporter"}'

logs

---> Running application from Python script (app.py) ...
2021-09-03 11:10:34,461:INFO:configuration: Metric data rolling training window size: 6 days, 23:59:59.895134
2021-09-03 11:10:34,461:INFO:configuration: Model retraining interval: 30 minutes
2021-09-03 11:10:34,548:ERROR:prophet.plot: Importing plotly failed. Interactive plots will not work.
2021-09-03 11:10:34,565:INFO:__main__: Training models using ProcessPool of size:1
2021-09-03 11:10:34,583:INFO:__main__: Initializing Tornado Web App
2021-09-03 11:10:34,594:INFO:__main__: Will retrain model every 30 minutes

And it's always in such state without eny errors/additions.

WebServer Output

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 1349.0
python_gc_objects_collected_total{generation="1"} 384.0
python_gc_objects_collected_total{generation="2"} 9.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 368.0
python_gc_collections_total{generation="1"} 35.0
python_gc_collections_total{generation="2"} 3.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="6",version="3.8.6"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 8.47609856e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 8.1199104e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.6306674336e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.02
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 17.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

But Pad is working OK only when we're not using yaml-folded style with quotes

  FLT_METRICS_LIST:  'node_load1{instance=~"master[0-1]{1}.openshift-46.cc.local", job="node-exporter"}'

logs

---> Running application from Python script (app.py) ...
2021-09-03 10:44:11,846:INFO:configuration: Metric data rolling training window size: 9 days, 23:59:59.893033
2021-09-03 10:44:11,846:INFO:configuration: Model retraining interval: 30 minutes
2021-09-03 10:44:11,920:ERROR:prophet.plot: Importing plotly failed. Interactive plots will not work.
2021-09-03 10:44:11,965:INFO:__main__: Training models using ProcessPool of size:1
2021-09-03 10:44:13,639:INFO:model: training data range: 2021-08-24 10:44:21.489000192 - 2021-09-03 10:44:11
Initial log joint probability = -617.279
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
      91       92923.1   1.12373e-05       254.608   3.643e-08       0.001      169  LS failed, Hessian reset
      99       92923.1   1.30199e-05       349.222      0.1801      0.1801      180
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     199       92923.4   3.20148e-05       108.114           1           1      295
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     211       92923.5   6.71466e-06       144.878   3.066e-08       0.001      361  LS failed, Hessian reset
     247       92923.5   2.58154e-06       91.6029      0.2417           1      404
Optimization terminated normally:
  Convergence detected: relative gradient magnitude is below tolerance
2021-09-03 10:44:50,806:INFO:__main__: Total Training time taken = 0:00:38.657156, for metric: node_load1 {'cluster': 'FATOWL', 'endpoint': 'https', 'instance': 'master0.openshift-46.cc.local', 'job': 'node-exporter', 'namespace': 'openshift-monitoring', 'pod': 'node-exporter-zcm5l', 'prometheus': 'openshift-monitoring/vmagent-full-cluster', 'service': 'node-exporter'}
2021-09-03 10:44:51,214:INFO:model: training data range: 2021-08-24 10:44:59.908000 - 2021-09-03 10:44:44.844999936
Initial log joint probability = -497.355
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
      99         97480     0.0019325       964.388      0.0927           1      128
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     199       97481.2   0.000180419       370.443           1           1      244
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     275       97481.4   2.84509e-06       94.2604      0.1849       0.601      339
Optimization terminated normally:
  Convergence detected: relative gradient magnitude is below tolerance
2021-09-03 10:45:23,475:INFO:__main__: Total Training time taken = 0:00:32.454399, for metric: node_load1 {'cluster': 'FATOWL', 'endpoint': 'https', 'instance': 'master1.openshift-46.cc.local', 'job': 'node-exporter', 'namespace': 'openshift-monitoring', 'pod': 'node-exporter-5zqns', 'prometheus': 'openshift-monitoring/vmagent-full-cluster', 'service': 'node-exporter'}
2021-09-03 10:45:23,937:INFO:model: training data range: 2021-08-24 10:45:36.489000192 - 2021-09-03 10:45:21.480999936
Initial log joint probability = -746.684
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
      99       81232.6   0.000545564       226.483       4.188           1      133
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     199       81233.4   0.000497212       222.155           1           1      261
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     284       81233.7   6.57884e-06       100.964   7.661e-08       0.001      413  LS failed, Hessian reset
     299       81233.7   1.63087e-06       67.7942      0.3194      0.3194      431
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     304       81233.7   4.37307e-06       66.3574      0.2549      0.2549      437
Optimization terminated normally:
  Convergence detected: relative gradient magnitude is below tolerance
2021-09-03 10:46:03,850:INFO:__main__: Total Training time taken = 0:00:40.063231, for metric: node_load15 {'cluster': 'FATOWL', 'endpoint': 'https', 'instance': 'master0.openshift-46.cc.local', 'job': 'node-exporter', 'namespace': 'openshift-monitoring', 'pod': 'node-exporter-zcm5l', 'prometheus': 'openshift-monitoring/vmagent-full-cluster', 'service': 'node-exporter'}
2021-09-03 10:46:04,308:INFO:model: training data range: 2021-08-24 10:46:14.908000 - 2021-09-03 10:45:59.844999936
Initial log joint probability = -1185.87
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
      99       82879.1     0.0025177       285.807           1           1      129
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     185       82880.7   9.29608e-06       111.383   3.551e-08       0.001      279  LS failed, Hessian reset
     199       82881.1   0.000202809       239.588       1.907      0.1907      298
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     299       82882.2   0.000794605       685.626           1           1      424
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     399         82883   0.000535337       285.017           1           1      536
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     499       82886.5   0.000777241       221.831           1           1      651
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     599       82887.5    1.9664e-05       148.715      0.7284      0.7284      773
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     655       82887.6   4.02456e-06       77.3026      0.2421           1      848
Optimization terminated normally:
  Convergence detected: relative gradient magnitude is below tolerance
2021-09-03 10:47:17,452:INFO:__main__: Total Training time taken = 0:01:13.295541, for metric: node_load15 {'cluster': 'FATOWL', 'endpoint': 'https', 'instance': 'master1.openshift-46.cc.local', 'job': 'node-exporter', 'namespace': 'openshift-monitoring', 'pod': 'node-exporter-5zqns', 'prometheus': 'openshift-monitoring/vmagent-full-cluster', 'service': 'node-exporter'}
2021-09-03 10:47:17,829:INFO:__main__: Initializing Tornado Web App
2021-09-03 10:47:18,011:INFO:__main__: Will retrain model every 30 minutes
2021-09-03 10:47:23,304:INFO:tornado.access: 200 GET /metrics (172.19.21.219) 880.53ms
2021-09-03 10:47:52,466:INFO:tornado.access: 200 GET /metrics (172.19.21.219) 52.52ms
2021-09-03 10:48:22,475:INFO:tornado.access: 200 GET /metrics (172.19.21.219) 62.54ms
2021-09-03 10:48:52,472:INFO:tornado.access: 200 GET /metrics (172.19.21.219) 59.23ms
2021-09-03 10:49:22,465:INFO:tornado.access: 200 GET /metrics (172.19.21.219) 52.47ms

I think it needs to handle value from this metrics and cut-off quotes if pad cannot process with it or change logic of start web server or forming metrics list

See output from pad container how it saves values from variale:

(Working OK)

(app-root) sh-4.4$ echo $FLT_METRICS_LIST
node_load15{instance="master0.openshift-46.cc.local", job="node-exporter"}

(Not prophet calculations from tornado, only python initial data but no errors in log)

(app-root) sh-4.4$ echo $FLT_METRICS_LIST
'node_load15{instance="master0.openshift-46.cc.local", job="node-exporter"}'

Here is my testing Deployment in OpenShift

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pad
  namespace: pad-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-anomaly-detector
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: prometheus-anomaly-detector
    spec:
      hostNetwork: true
      containers:
        - name: pad
          image: quay.io/aicoe/prometheus-anomaly-detector:latest
          imagePullPolicy: IfNotPresent
          ports:
            - name: calculations
              containerPort: 8080
              protocol: TCP
          livenessProbe:
            httpGet:
              scheme: HTTP
              path: /
              port: calculations
            initialDelaySeconds: 300
            periodSeconds: 60
            successThreshold: 1
            failureThreshold: 5
          envFrom:
            - configMapRef:
                name: pad-config
          resources:
            requests:
              memory: 1Gi
              cpu: "1"
            limits:
              memory: 2Gi
              cpu: "2"
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      securityContext: {}
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
@petr-r94 petr-r94 changed the title ENV 'FLT_METRICS_LIST' and timestamps looks like incorrectly parsed ENV 'FLT_METRICS_LIST' and timestamps looks incorrectly parsed Sep 3, 2021
@sesheta
Copy link

sesheta commented Dec 2, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2021
@sesheta
Copy link

sesheta commented Jan 1, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 1, 2022
@afirth
Copy link

afirth commented Jan 7, 2022

/remove-lifecycle rotten

@sesheta sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 7, 2022
@sesheta
Copy link

sesheta commented Apr 7, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 7, 2022
@sesheta
Copy link

sesheta commented May 7, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 7, 2022
@sesheta
Copy link

sesheta commented Jun 6, 2022

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Jun 6, 2022
@sesheta
Copy link

sesheta commented Jun 6, 2022

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants