-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration capabilities to retry for loading config via URL #8854
Labels
Comments
sjwang90
added
feature request
Requests for new plugin and for new features to existing plugins
area/configuration
labels
Feb 12, 2021
next step: investigate design and implications |
It's quite normal requirement considering a power outage at home. The modem and router need time to connect to Internet, and the telegraf service with a url config just quickly tries several time and completely fails. |
powersj
added a commit
to powersj/telegraf
that referenced
this issue
May 17, 2024
This introduces a new cli option to allow the user to set the number of retry attempts to something other than 3. It also allows the user to set the attempt count to -1 to infinitely retry. fixes: influxdata#8854
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Feature Request
Related: #7338
Proposal:
User should be able to designated the
interval
andnumber of retries
for loading their config from a URL if their endpoint is down.Current behavior:
Right now, Telegraf retries three times at 10s intervals when receiving an error on loading config from a url in the case of the remote endpoint being down. Current solution does not use env variables or use flags to change these settings (based on #8803).
Desired behavior:
User needs some way to configure
interval
andnumber of retries
settings to determine the behavior of loading the config from a URL.Use case:
From @schmorgs:
Planning to use Telegraf in production across a large number of servers across the globe, and there are many points where breakages could happen, especially in countries where there is very low bandwidth and old infrastructure. Along with that comes many standards and versions of OS, etc, hence our approach to manage config centrally so that we don't have to navigate the variety of ways of reaching an endpoint.
So if Telegraf starts up and there happened to be a breakage somewhere (NW connectivity, Web Server down, etc), the agent will die. On RHEL7/8 and Windows, we can utilise systemd/SCM to configure infinite retries on the agent so that even if it does die, it will be restarted.
But RHEL6 doesn't have systemd and so we would end up writing some sort of watcher daemon as well which seems a bit overkill if the agent could handle (at least) this condition.
The reason for the importance is this will be our primary monitoring agent and so want to make this as available and robust as possible. We would still implement external controls such as systemd restarts to provide an extra layer of resilience, but the more the agent can do in this area makes just adds to this.
In some cases, the situation where the agent was unable to get config would be fairly small as the agent only pulls config on startup. But we want the agent to periodically pull its config down so that it can be configured centrally and automatically pulled by the agent. I understand this is part of a longer term strategy for Telegraf, but in the meantime, we HUP the agent periodically as a workaround, and so now the agent has constant reliability on the HTTP endpoint and therefore, more likelihood of encountering a problem.
Whether a switch, environment variable, config file on the server, etc, I'm happy to see whichever approach works best.
The text was updated successfully, but these errors were encountered: