Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-downtime deployments #194

Open
kevinlul opened this issue Oct 10, 2022 · 6 comments
Open

Zero-downtime deployments #194

kevinlul opened this issue Oct 10, 2022 · 6 comments

Comments

@kevinlul
Copy link
Contributor

A new release to the live Bastion could cause up to a minute of downtime, rounding up, due to stopping the bot process and then starting up a new process, which reconnects shard-by-shard to the Discord gateway to not be rate-limited. To have no downtime, deployments need to start up the new process first, make sure that no duplicate responses happen in the deployment period while there are two processes running, then stop the old process once the new process has connected all shards to the gateway. Two things must be implemented for this to happen: container start before stop and a lock manager.

Container start before stop

In Swarm, this is configured as deploy.update_config.order: start-first (docs). This is not supported by Compose v1 or v2. Therefore, to use this, production must be switched to a single-node Swarm. (Not yet at the scale where additional benefits are reaped from separating shards into their own processes.)

A pure Compose solution could be to use different project names with each deployment, as long as previous project names are kept track of so the old stack can be taken down.

Lock manager

Since Bastion is not yet at the scale for sharding across multiple hosts, the fastest solution should be write-ahead-log SQLite. Before processing a message or interaction, attempt to INSERT the snowflake into a table. Continue only if this succeeds, as we have the lock. If not, a different process has taken the lock. In the general case, this kind of overhead could help with Discord eventual consistency, though this has never been a problem in practice (receiving an event N times). This overhead could also be limited to the deployment window, being toggled upon receiving a certain Unix signal.

Caveats

VM memory demands increase since it must support two bot Node.js containers running during the deployment window. The addition of <> card search has already increased memory demands.

In general (not just the zero-downtime case), how button timeouts behave across a redeployment or a bot restart should be considered.

@kevinlul
Copy link
Contributor Author

kevinlul commented Nov 2, 2022

There should now be RAM capacity for this.

kevinlul added a commit that referenced this issue Nov 4, 2022
kevinlul added a commit that referenced this issue Nov 4, 2022
kevinlul added a commit that referenced this issue Nov 5, 2022
kevinlul added a commit that referenced this issue Nov 5, 2022
@kevinlul kevinlul self-assigned this Nov 6, 2022
kevinlul added a commit that referenced this issue Nov 7, 2022
kevinlul added a commit that referenced this issue Nov 7, 2022
kevinlul added a commit that referenced this issue Nov 7, 2022
@kevinlul
Copy link
Contributor Author

kevinlul commented Nov 8, 2022

Signal mechanism is insufficient in case of container restarts, since there will be nothing to signal the EventLocker off. Should add a timeout from boot instead.

@kevinlul
Copy link
Contributor Author

kevinlul commented Jul 8, 2023

https://github.com/Wowu/docker-rollout New script that's very well done. This can be incorporated to start the new container before removing the first under Compose, without switching to a single-node Swarm. Installation should be version-fixed with checksum like docker-stack-wait for Swarm.

@kevinlul
Copy link
Contributor Author

kevinlul commented Jul 8, 2023

(Optional) Add a third CLI parameter to specify the lock database location. Its presence enables EventLocker. This allows specifying a separate tmpfs, so there's never a disk write. To share the lock database between Docker containers, we can't use Docker's tmpfs volume type as it can't be shared. Bind mount the host's /dev/shm.

@kevinlul
Copy link
Contributor Author

kevinlul commented Jul 8, 2023

To properly assess the state of the containers, a healthcheck is needed. Create an HTTP healthcheck using built-in node:http, responding 200 OK if and only if all Discord bot shards are ready. This HTTP server may listen on a TCP port or a Unix socket in /run or /tmp. The Dockerfile should have a healthcheck for this endpoint. The additional timeout EventLocker shutoff mechanism should only start once all shards are ready, if possible.

@kevinlul kevinlul removed their assignment Aug 1, 2023
@kevinlul
Copy link
Contributor Author

Alternate concept based on https://github.com/meister03/discord-hybrid-sharding:

Add a CLI parameter that starts the bot in standby, with the current timestamp as the value. When the bot is started in standby, only a base set of event listeners are registered (warn, error, shard&*, ready): https://github.com/DawnbrandBots/bastion-bot/blob/master/src/bot.ts

Since the bot program will start in standby, the new instance will not start handling events while the old instance is still up. Once the ready event is emitted, the deployment system can detect this and simultaneously signal the new bot to become active while shutting off the old bot. The new bot becomes active by registering all remaining event listeners (guildCreate, guildDelete, listeners array containing interaction, messageCreate, messageDelete). At this point, the takeover and deployment is complete.

Should the bot process crash or otherwise be restarted, it can calculate its start time and compare this with the timestamp in the CLI parameter, If the delay is too high, then it knows it has been restarted and can start up in active mode instead of hanging in standby when there's no deployment system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant