-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some incorrect assumptions in timeseries processing. #44
Comments
Agreed with the reasoning. Unfortunately, the current behaviour cannot be changed in existing software like VictoriaMetrics, since this will break expectations of many users, who rely on the current behaviour :( On the other hand, VictoriaMetrics can accept an option at startup or at query time, which would tell it to shift the calculation time by @dbaarda , could you file a feature request at VictoriaMetrics issue tracker, so it could be tracked and prioritized? As a temporary workaround, it is possible to add the needed time shift manually via
The |
Thanks for the quick followup. Using the So I'm not sure just adding a time shift is actually going to make things better. |
I wish people writing timeseries databases would read and understand all the rrdtool documentation first. That tool contains a huge amount of accumulated knowledge/wisdom about how to handle timeseries efficiently and accurately.
In https://docs.victoriametrics.com/keyconcepts/#instant-query the documentation talks about how queries for timeseries values at a particular time look for the datapoint before that time.
However, in general timeseries datapoints better represent the time period before, so it's more accurate to use the datapoint after the requested time. For rate timeseries with points at regular intervals, each point represents the average rate for the period between that point and the previous point. Using the rate datapoint after the requested time gives you the best possible instantaneous rate estimate for that time. However, if you are querying a rate1m timeseries (points every minute that represent the average rate over the previous minute) and want the best rate1m at a particular time (average rate over the minute before the particular time), then the most accurate estimate you can get is by linearly interpolating between the point after and before the requested time.
This "datapoints represent the time before the datapoint, not after" behaviour is definitely true for rates, but is also generally true for gauges and counters too. In general linearly interpolating between points gives the most accurate estimate.
When there are missing datapoints, it becomes even more crucial to do this right, as you are approximating over longer time intervals, so using the wrong approximation makes the answer even more wrong. Just returning the value of points before the requested time as if it was at the requested time is time-shifting the value to an arbitrary, up to your
step
interval, later time. It would be better to provide the actual timestamp in the answer, but I'm not sure frontends would be happy with that.If the timeseries lookups are time-shifting data-points like this by arbitrary times, then any queries that combine different timeseries will be using values taken at different times. This can be the source of many problems, like error rates greater than 100%.
Doing this correctly would answer many of the problems trying to be addressed with https://docs.victoriametrics.com/keyconcepts/#query-latency.
The text was updated successfully, but these errors were encountered: