Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The routing_table_nodes metric format is wrong #1528

Closed
jakubgs opened this issue Apr 4, 2023 · 8 comments
Closed

The routing_table_nodes metric format is wrong #1528

jakubgs opened this issue Apr 4, 2023 · 8 comments
Labels

Comments

@jakubgs
Copy link
Member

jakubgs commented Apr 4, 2023

I have been seeing some errors on the metrics backend that started on the 17th of March:

image

Those errors are duplicate sample for timestamp and always are triggered for the routing_table_nodes metric:

ts=2023-04-04T07:33:20.944Z caller=dedupe.go:112 component=remote level=error remote_name=cortex url=http:https://host-gateway:19092/api/v1/push msg="non-recoverable error" count=1000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: user=fake: err: duplicate sample for timestamp. timestamp=2023-04-04T07:33:20.605Z, series={__name__=\"routing_table_nodes\", container=\"nimbus-fluffy-mainnet-master-25\", datacenter=\"he-eu-hel1\", fleet=\"nimbus.fluffy\", group=\",nimbus.fluffy,eth1,nimbus,metrics,\", instance=\"metal-02.he-eu-hel1.nimbus.fluffy\", job=\"nimbus-fluffy-metrics\", source=\"slave-01.he-eu-hel1.metrics.hq\"}"

If we look at the metric itself it appears to be broken:

[email protected]:~ % c 0:9210/metrics | grep routing_table_nodes      
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes{state=""} 181.0
routing_table_nodes_created{state=""} 1680593730.0
routing_table_nodes{state="seen"} 89.0
routing_table_nodes_created{state="seen"} 1680593730.0
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes 0.0
routing_table_nodes_created 1680593730.0

As we can see it's listed twice. Once with state label, which has "" and "seen" values, but also as a metric without any labels.

This looks like a bug that was introduces somewhere around 17th of March.

@kdeme kdeme added the Fluffy label Apr 4, 2023
@kdeme
Copy link
Contributor

kdeme commented Jul 18, 2023

This might have to do with the fact that we use the same routing_table code for discv5 and each different Portal sub-network.

Aside from the weird double metric, the metric itself will also be incorrect as it will hold the data for all networks together.

We should find a way to split this routing_table_nodes metric per network / routing table instance.

@jakubgs
Copy link
Member Author

jakubgs commented Jul 19, 2023

Hah, I forgot about this. It would be nice to fix this, but for now I'll just drop this metric:

@jakubgs
Copy link
Member Author

jakubgs commented Jul 19, 2023

Actually, it appears this issue also exists in Nim-Waku nodes:

[email protected]:~ % grep metrics /docker/nim-waku-v2/docker-compose.yml    
      --metrics-server=True
      --metrics-server-port=8008
      --metrics-server-address=0.0.0.0
[email protected]:~ % c 0:8008/metrics | grep routing_table_nodes         
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes{state=""} 43.0
routing_table_nodes_created{state=""} 1689722335.0
routing_table_nodes{state="seen"} 43.0
routing_table_nodes_created{state="seen"} 1689722335.0
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes 0.0
routing_table_nodes_created 1689722327.0

So I've dropped those too:

And opened an issue:

@jakubgs
Copy link
Member Author

jakubgs commented Oct 31, 2023

Is anyone going to fix it at any point? Hello?

@kdeme
Copy link
Contributor

kdeme commented Oct 31, 2023

My initially quick assessment at #1528 (comment) was not the actually cause for this (although that work should still be done).

The cause is some import/export pollution of the discv4 routing table code. And although this code is not actually used, it has the same metric name (without the label). Quickfix for now is to rename that one, see status-im/nim-eth#646

@kdeme
Copy link
Contributor

kdeme commented Oct 31, 2023

Fix in #1874

@jakubgs
Copy link
Member Author

jakubgs commented Nov 2, 2023

Thank you.

@jakubgs
Copy link
Member Author

jakubgs commented Nov 2, 2023

It appears the last instance of this error in Prometheus for ih-eu-mda1 DC was today at midnight:

[email protected]:~ % zgrep routing_table_nodes /var/log/docker/prometheus-slave/docker.* | tail -n1
/var/log/docker/prometheus-slave/docker.log:2023-11-02T00:05:45.836145+00:00 docker/prometheus-slave[840]: ts=2023-11-02T00:05:45.835Z caller=dedupe.go:112 component=remote level=error remote_name=cortex url=http:https://host-gateway:19092/api/v1/push msg="non-recoverable error" count=1000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: user=fake: err: duplicate sample for timestamp. timestamp=2023-11-02T00:05:45.245Z, series={__name__=\"routing_table_nodes\", container=\"nimbus-fluffy-mainnet-master-27\", datacenter=\"ih-eu-mda1\", fleet=\"nimbus.fluffy\", group=\",nimbus.fluffy,eth1,nimbus,metrics,\", instance=\"metal-02.ih-eu-mda1.nimbus.fluffy\", job=\"nimbus-fluffy-metrics\", source=\"slave-01.ih-eu-mda1.metrics.hq\"}"

And the graph does show this too:

image

Which also matches with the build timer run on metal-01.ih-eu-mda1.nimbus.eth1:

[email protected]:~ % j -n2 -u build-nimbus-eth1-goerli-master.service 
Nov 02 00:02:53 metal-01.ih-eu-mda1.nimbus.eth1 systemd[1]: Finished Build nimbus-eth1-goerli-master.
Nov 02 00:02:53 metal-01.ih-eu-mda1.nimbus.eth1 systemd[1]: build-nimbus-eth1-goerli-master.service: Consumed 9min 24.656s CPU time.

So I consider this fixed. Thank you @kdeme .

@jakubgs jakubgs closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants