Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to enforce strict monotonicy in metrics #141

Merged
merged 13 commits into from
Mar 25, 2024

Conversation

chrschn
Copy link
Contributor

@chrschn chrschn commented Dec 11, 2023

This change adds the new option force_monotonicy to metric configurations. It is intended for quasi monotonic sources, such as monotonic counters which reset when the sensor is restarted.

When this option is set to true, the source metric value is regularly written to disk. This allows us to detect and compensate counter resets even between restarts. When a reset is detected, the last value before the reset becomes the new offset, which is added to the metric value going forth. The result is a strictly monotonic time series, like an ever increasing counter.

My use-case is measuring the energy production and consumption with Shelly devices. I want to keep track of the total energy ever produced by solar panels or consumed by certain devices in my household. When those time series are truly monotonic, it is much simpler to produce delta graphs over long periods (months, years) in Prometheus.

@hikhvar
Copy link
Owner

hikhvar commented Dec 14, 2023

First of all, thank you for the PR.
From my understanding with prometheus, you don't need that logic in the exporter. All prometheus queries for counters, will handle counter resets. See for example increase(). According to the docs:

Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for

Due to that handling in Prometheus server, I will not merge this. From an architecture point of view, I thing this is something the prometheus server is capable of, and already implemented. I would like this exporter to not write to disk regularly, as many use this exporter on devices like Raspberry Pis with SD-cards.

@chrschn
Copy link
Contributor Author

chrschn commented Dec 15, 2023

First of all, thank you for the PR. From my understanding with prometheus, you don't need that logic in the exporter. All prometheus queries for counters, will handle counter resets. See for example increase(). According to the docs:

Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for

It's correct that Prometheus handles counter resets, but increase() doesn't work well for very long time series (months, years) as the ability to detect resets depends on the sample resolution, which becomes very coarse. By default, the maximum no. of data points per time series in Prometheus is 11,000. For a time period of 1 year, the max. resolution is ~48 minutes. This leaves wide gaps for Prometheus to miss resets.

Due to that handling in Prometheus server, I will not merge this. From an architecture point of view, I thing this is something the prometheus server is capable of, and already implemented. I would like this exporter to not write to disk regularly, as many use this exporter on devices like Raspberry Pis with SD-cards.

I took care to only write data when the force_monotonicy setting is used (defaults to off), so existing users would not notice the difference unless they enable this feature. We could add a warning to that option, if you think it would help.

Actually, I've already implemented the next step to support arithmetic expressions that reuses the disk storage, but I haven't sent out a PR yet (see my branch).

My use-case is measuring the energy produced by my solar panels. I want to be able to see the total energy ever produced over months and years, as well as how much energy was used in my house vs. drained to the grid. My energy meter reports the total energy (kWh) obtained from the grid and it reports the current power usage (W) as a negative value when the solar panels produce more energy than is used. However, it does not integrate the current power to the energy that was "lost" to the grid. I'm currently using a cascade of Prometheus recording rules with several labeling hacks to integrate the values over time:

  - name: aggregation:1m
    interval: 1m
    rules:
      # Solar energy lost to the grid over 1m [kWh].
      - record: energy_loss:1m
        expr: avg_over_time(power_out{power_source="solar"}[1m]) / 60000
        labels:
          power_source: solar
          power_sink: grid

  - name: aggregation:5m
    interval: 5m
    rules:
      # Solar energy lost to the grid over 5m, with additional labels for day, month, year [kWh].
      - record: daily:energy_loss:5m
        expr: label_join(
                label_replace(
                  label_replace(
                    label_replace(
                      sum_over_time(energy_loss:1m[5m:1m]) + ignoring(year, month, day) group_right
                        count_values without() ("year", year(timestamp(
                          count_values without() ("month", month(timestamp(
                            count_values without() ("day", day_of_month(timestamp(
                              sum_over_time(energy_loss:1m[5m:1m])
                            )))
                          )))
                        ))) * 0,
                      "day", "0$1", "day", "^([1-9])$"),
                    "month", "0$1", "month", "^([1-9])$"),
                  "yy", "$1", "year", "^..(..)$"),
                "date", "-", "year", "month", "day"
              ) 

This is good enough to produce graphs like the one below, but this technique does not scale beyond one year. The cleanest solution is to have truly monotonic counters without resets for energy production and usage. That way, I can always see the correct totals and compute a delta without having to take the sampling resolution into account.

grafana

@chrschn
Copy link
Contributor Author

chrschn commented Dec 19, 2023

I rebased my change on top of the current master. I'd be happy to hear your thoughts on my previous response.

@hikhvar
Copy link
Owner

hikhvar commented Jan 13, 2024

Hello,

sorry for the late response. Didn't had the capacity in the past to read your long and good response!
I can see your use case.

Let us integrate this, and improve on it, if users report problems.

alexbakker and others added 12 commits February 5, 2024 21:04
This patch renames the ``received_messages`` metric to
``mqtt2prometheus_received_messages_total``, making it a bit more in
line with conventional Prometheus metric naming.

I also slightly adjusted the descriptions.
JSON needs to be upper case, otherwiese "could not setup a metric extractor	{"error": "unsupported object format: json"} " is thrown.
This change adds the new option `force_monotonicy` to metric
configurations. It is intended for almost-but-not-really monotinic
sources, such as counters which reset when the sensor is restarted.

When this option is set to `true`, the source metric value is regularly
written to disk. This allows us to detect and compensate counter resets
even between restarts. When a reset is detected, the last value before
the reset becomes the new offset, which is added to the metric value
going forth.  The result is a strictly monotonic time series, like an
ever increasing counter.
@chrschn
Copy link
Contributor Author

chrschn commented Feb 5, 2024

Hey there. Sorry, now I was the one who was busy. I changed the code to make sure it works with Go 1.14. LMK if you have any comments.

@chrschn
Copy link
Contributor Author

chrschn commented Mar 21, 2024

Friendly ping. :)

@hikhvar hikhvar merged commit 66d4028 into hikhvar:master Mar 25, 2024
@chrschn chrschn mentioned this pull request Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants