TypeError Exception #151

tiagokrebs · 2021-07-10T13:19:52Z

Hi.

I'm having some trouble figuring out the motive behind one TypeError exception that apparently happens during the request on the metrics endpoint. If this is not the right place for this please let me know.

My container is created like this:

containers:
      - name: prometheus-anomaly-detector
        image: quay.io/aicoe/prometheus-anomaly-detector:latest
        ports:
        - containerPort: 8080
        env:
        - name: FLT_PROM_URL
          value: "http:https://xxxxxxxxx:9090"
        - name: FLT_RETRAINING_INTERVAL_MINUTES
          value: "15"
        - name: FLT_METRICS_LIST
          value: "nginx_connections_processed_total:rate1m{app='edg', env='prod', instance='xxx', job='azion_hosts', loc='cgh', loc_pop='cghnjd', pop='njd', service='nginx_exporter', stage='any'};nginx_connections_processed_total:rate5m{app='edg', env='prod', instance='xxx', job='azion_hosts', loc='cgh', loc_pop='cghnjd', pop='njd', service='nginx_exporter', stage='any'};nginx_request_duration_bucket_90th:rate1m{instance='xxx'};nginx_request_duration_bucket_90th:rate5m{instance='xxx'}"
        - name: APP_FILE
          value: "app.py"
        - name: FLT_DATA_START_TIME
          value: "3d"
        - name: FLT_ROLLING_TRAINING_WINDOW_SIZE
          value: "7d"

As you can see I have two metrics with many labels and two with just one. Also, both metric types have different amounts of points (rate 1m and 5m). This is just me testing different things, there is no particular reason for my approach.
The values of these metrics are always integers. Currently, I only have about 24h of data to train.

I have the Prometheus scrape configured to request the PAD endpoint every 15 seconds. The average response is ~300ms. But sometimes the exception below occur.

Traceback (most recent call last):
  File "app.py", line 177, in <module>
    schedule.run_pending()
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 780, in run_pending
    default_scheduler.run_pending()
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 100, in run_pending
    self._run_job(job)
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 172, in _run_job
    ret = job.run()
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 661, in run
    ret = self.job_func()
  File "app.py", line 141, in train_model
    predictor_model.train(
  File "/model.py", line 31, in train
    self.metric += Metric(metric_data)
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/prometheus_api_client/metric.py", line 173, in __add__
    raise TypeError("Cannot Add different metric types. " + error_string)

Some noticed behaviors after this:

The pod is not terminated and keeps running
The retraining procedures are not triggered anymore
The PAD metrics endpoint keeps responding, but now the value of the metrics is always the same.

Any help will be much appreciated.

The text was updated successfully, but these errors were encountered:

4n4nd · 2021-07-12T14:37:00Z

@tiagokrebs a few points here.

This error occurs when two different time series are tried to combine together (source)
The scraping interval of 15 seconds is two low since by default the Prophet model is only predicting values for every minute.
Could you please run the container again with FLT_DEBUG_MODE=True to enable debug mode and post the logs here, so we can see which 2 metrics have the conflict?
You are correct, the application should raise this error and terminate. This is improper behavior in the application.
Also, I really like your graph panel, would you able to share the panel json with us?

tiagokrebs · 2021-07-12T23:57:01Z

Thanks @4n4nd.

The scraping interval of 15 seconds is two low since by default the Prophet model is only predicting values for every minute.

Ok, I should lower it to 1 minute next then.

Could you please run the container again with FLT_DEBUG_MODE=True to enable debug mode and post the logs here, so we can see which 2 metrics have the conflict?

Here are the logs with debug mode on.

# successful prometheus scrape
2021-07-12 23:15:37,824:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate1m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D HTTP/1.1" 200 252
2021-07-12 23:15:37,909:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate5m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D HTTP/1.1" 200 251
2021-07-12 23:15:37,992:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_request_duration_bucket_90th%3Arate1m%7Binstance%3D%27myhost.xxx.com%27%7D HTTP/1.1" 200 193
2021-07-12 23:15:38,071:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_request_duration_bucket_90th%3Arate5m%7Binstance%3D%27myhost.xxx.com%27%7D HTTP/1.1" 200 193
2021-07-12 23:15:38,085:INFO:tornado.access: 200 GET /metrics (10.244.0.149) 337.20ms

# exception during scrape
2021-07-12 23:15:42,992:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate1m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D HTTP/1.1" 200 253
2021-07-12 23:15:43,072:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate5m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D HTTP/1.1" 200 252
2021-07-12 23:15:43,121:DEBUG:schedule: Running job Job(interval=15, unit=minutes, do=train_model, args=(), kwargs={'initial_run': False, 'data_queue': <multiprocessing.queues.Queue object at 0x7f46bcc4ab50>})
2021-07-12 23:15:43,121:DEBUG:prometheus_api_client.prometheus_connect: start_time: 2021-07-12 23:00:43.124236
2021-07-12 23:15:43,121:DEBUG:prometheus_api_client.prometheus_connect: end_time: 2021-07-12 23:15:43.121443
2021-07-12 23:15:43,121:DEBUG:prometheus_api_client.prometheus_connect: chunk_size: None
2021-07-12 23:15:43,122:DEBUG:prometheus_api_client.prometheus_connect: Prometheus Query: nginx_connections_processed_total:rate1m{app='edg',env='prod',instance='myhost.xxx.com',job='azion_hosts',loc='cgh',loc_pop='cghnjd',pop='njd',service='nginx_exporter',stage='any'}
2021-07-12 23:15:43,152:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_request_duration_bucket_90th%3Arate1m%7Binstance%3D%27myhost.xxx.com%27%7D HTTP/1.1" 200 193
2021-07-12 23:15:43,196:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_request_duration_bucket_90th%3Arate5m%7Binstance%3D%27myhost.xxx.com%27%7D HTTP/1.1" 200 589
2021-07-12 23:15:43,203:INFO:tornado.access: 200 GET /metrics (10.244.0.99) 285.91ms
2021-07-12 23:15:43,240:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate1m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D%5B900s%5D&time=1626131743 HTTP/1.1" 200 194
Traceback (most recent call last):
  File "app.py", line 177, in <module>
    schedule.run_pending()
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 780, in run_pending
    default_scheduler.run_pending()
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 100, in run_pending
    self._run_job(job)
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 172, in _run_job
    ret = job.run()
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/schedule/__init__.py", line 661, in run
    ret = self.job_func()
  File "app.py", line 141, in train_model
    predictor_model.train(
  File "/model.py", line 31, in train
    self.metric += Metric(metric_data)
  File "/opt/conda/envs/prophet-env/lib/python3.8/site-packages/prometheus_api_client/metric.py", line 173, in __add__
    raise TypeError("Cannot Add different metric types. " + error_string)
TypeError: Cannot Add different metric types. Different metric names

# frozen metrics to all scrapes after
2021-07-12 23:15:52,824:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate1m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D HTTP/1.1" 200 252
2021-07-12 23:15:52,912:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_connections_processed_total%3Arate5m%7Bapp%3D%27edg%27%2Cenv%3D%27prod%27%2Cinstance%3D%27myhost.xxx.com%27%2Cjob%3D%27azion_hosts%27%2Cloc%3D%27cgh%27%2Cloc_pop%3D%27cghnjd%27%2Cpop%3D%27njd%27%2Cservice%3D%27nginx_exporter%27%2Cstage%3D%27any%27%7D HTTP/1.1" 200 251
2021-07-12 23:15:52,995:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_request_duration_bucket_90th%3Arate1m%7Binstance%3D%27myhost.xxx.com%27%7D HTTP/1.1" 200 193
2021-07-12 23:15:53,077:DEBUG:urllib3.connectionpool: http:https://pronetheus.xxx:9090 "GET /api/v1/query?query=nginx_request_duration_bucket_90th%3Arate5m%7Binstance%3D%27myhost.xxx.com%27%7D HTTP/1.1" 200 193
2021-07-12 23:15:53,084:INFO:tornado.access: 200 GET /metrics (10.244.0.149) 336.65ms

Also, I really like your graph panel, would you able to share the panel json with us?

Sure thing, is this below.

Grafana-Dashboard.json

{
    "annotations": {
      "list": [
        {
          "builtIn": 1,
          "datasource": "-- Grafana --",
          "enable": true,
          "hide": true,
          "iconColor": "rgba(0, 211, 255, 1)",
          "name": "Annotations & Alerts",
          "type": "dashboard"
        },
        {
          "datasource": null,
          "enable": true,
          "iconColor": "red",
          "limit": 100,
          "matchAny": true,
          "name": "pad",
          "tags": [
            "pad"
          ],
          "type": "tags"
        }
      ]
    },
    "editable": true,
    "gnetId": null,
    "graphTooltip": 0,
    "id": 621,
    "iteration": 1626112586233,
    "links": [],
    "panels": [
      {
        "aliasColors": {
          "Anomaly Detected": "semi-dark-red",
          "Original Metric": "semi-dark-green",
          "Predicted": "light-blue",
          "yhat_lower": "text",
          "yhat_upper": "text"
        },
        "bars": false,
        "dashLength": 10,
        "dashes": false,
        "datasource": "prometheus-prod",
        "fieldConfig": {
          "defaults": {
            "unit": "short"
          },
          "overrides": []
        },
        "fill": 5,
        "fillGradient": 7,
        "gridPos": {
          "h": 9,
          "w": 24,
          "x": 0,
          "y": 0
        },
        "hiddenSeries": false,
        "id": 12,
        "legend": {
          "alignAsTable": true,
          "avg": true,
          "current": true,
          "max": true,
          "min": true,
          "rightSide": true,
          "show": true,
          "total": false,
          "values": true
        },
        "lines": true,
        "linewidth": 1,
        "links": [],
        "nullPointMode": "null",
        "options": {
          "alertThreshold": true
        },
        "percentage": false,
        "pluginVersion": "8.0.1",
        "pointradius": 5,
        "points": false,
        "renderer": "flot",
        "seriesOverrides": [
          {
            "$$hashKey": "object:884",
            "alias": "Anomaly Detected",
            "yaxis": 2
          },
          {
            "$$hashKey": "object:903",
            "alias": "/.*yhat.*/",
            "dashes": true,
            "fill": 0,
            "fillGradient": 0,
            "linewidth": 2
          },
          {
            "$$hashKey": "object:926",
            "alias": "Predicted",
            "fill": 0,
            "fillGradient": 0,
            "linewidth": 2
          }
        ],
        "spaceLength": 10,
        "stack": false,
        "steppedLine": false,
        "targets": [
          {
            "exemplar": true,
            "expr": "nginx_connections_processed_total:rate5m_prophet{exported_instance=~\"$node\", service=\"pad_exporter\", value_type=\"yhat\"}",
            "format": "time_series",
            "hide": false,
            "interval": "",
            "intervalFactor": 1,
            "legendFormat": "Predicted",
            "refId": "A",
            "target": ""
          },
          {
            "exemplar": true,
            "expr": "nginx_connections_processed_total:rate5m_prophet{exported_instance=~\"$node\", service=\"pad_exporter\", value_type=\"yhat_upper\"}",
            "format": "time_series",
            "hide": false,
            "interval": "",
            "intervalFactor": 1,
            "legendFormat": "yhat_upper",
            "refId": "B",
            "target": ""
          },
          {
            "exemplar": true,
            "expr": "nginx_connections_processed_total:rate5m_prophet{exported_instance=~\"$node\", service=\"pad_exporter\", value_type=\"yhat_lower\"}",
            "format": "time_series",
            "hide": false,
            "interval": "",
            "intervalFactor": 1,
            "legendFormat": "yhat_lower",
            "refId": "C",
            "target": ""
          },
          {
            "exemplar": true,
            "expr": "nginx_connections_processed_total:rate5m_prophet{exported_instance=~\"$node\", service=\"pad_exporter\", value_type=\"anomaly\"}",
            "hide": false,
            "instant": false,
            "interval": "",
            "legendFormat": "Anomaly Detected",
            "refId": "D"
          },
          {
            "exemplar": true,
            "expr": "nginx_connections_processed_total:rate5m{instance=~\"$node\", stage=\"any\"}",
            "hide": false,
            "interval": "",
            "legendFormat": "Original Metric",
            "refId": "E"
          }
        ],
        "thresholds": [],
        "timeFrom": null,
        "timeRegions": [],
        "timeShift": null,
        "title": "Nginx Requests - RATE 5M - Anomaly Detection PoC",
        "tooltip": {
          "shared": true,
          "sort": 0,
          "value_type": "individual"
        },
        "type": "graph",
        "xaxis": {
          "buckets": null,
          "mode": "time",
          "name": null,
          "show": true,
          "values": []
        },
        "yaxes": [
          {
            "$$hashKey": "object:89",
            "format": "short",
            "label": null,
            "logBase": 1,
            "max": null,
            "min": null,
            "show": true
          },
          {
            "$$hashKey": "object:90",
            "format": "short",
            "label": "",
            "logBase": 1,
            "max": "1.05",
            "min": "0",
            "show": true
          }
        ],
        "yaxis": {
          "align": false,
          "alignLevel": null
        }
      }
    ],
    "refresh": "1m",
    "schemaVersion": 30,
    "style": "dark",
    "tags": [],
    "templating": {
      "list": [
        {
          "allValue": null,
          "current": {
            "selected": false,
            "text": "xxxxx",
            "value": "xxxxx"
          },
          "datasource": "prometheus-prod",
          "definition": "label_values(node_boot_time_seconds{app=\"edg\"}, instance)",
          "description": null,
          "error": null,
          "hide": 0,
          "includeAll": false,
          "label": null,
          "multi": false,
          "name": "node",
          "options": [],
          "query": {
            "query": "label_values(node_boot_time_seconds{app=\"edg\"}, instance)",
            "refId": "StandardVariableQuery"
          },
          "refresh": 2,
          "regex": "",
          "skipUrlSync": false,
          "sort": 0,
          "type": "query"
        }
      ]
    },
    "time": {
      "from": "now-12h",
      "to": "now"
    },
    "timepicker": {},
    "timezone": "",
    "title": "Anomaly Detection PoC",
    "uid": "iaimEAmnz",
    "version": 5
  }

sesheta · 2021-10-16T15:01:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta · 2021-11-15T15:12:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta · 2021-12-15T15:18:15Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta · 2021-12-15T15:18:20Z

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2021

sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 15, 2021

sesheta closed this as completed Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError Exception #151

TypeError Exception #151

tiagokrebs commented Jul 10, 2021 •

edited

Loading

4n4nd commented Jul 12, 2021

tiagokrebs commented Jul 12, 2021 •

edited by 4n4nd

Loading

sesheta commented Oct 16, 2021

sesheta commented Nov 15, 2021

sesheta commented Dec 15, 2021

sesheta commented Dec 15, 2021

TypeError Exception #151

TypeError Exception #151

Comments

tiagokrebs commented Jul 10, 2021 • edited Loading

4n4nd commented Jul 12, 2021

tiagokrebs commented Jul 12, 2021 • edited by 4n4nd Loading

sesheta commented Oct 16, 2021

sesheta commented Nov 15, 2021

sesheta commented Dec 15, 2021

sesheta commented Dec 15, 2021

tiagokrebs commented Jul 10, 2021 •

edited

Loading

tiagokrebs commented Jul 12, 2021 •

edited by 4n4nd

Loading