Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing documentation around cords or a bug #565

Open
olivermt opened this issue Nov 4, 2023 · 8 comments
Open

Confusing documentation around cords or a bug #565

olivermt opened this issue Nov 4, 2023 · 8 comments

Comments

@olivermt
Copy link

olivermt commented Nov 4, 2023

Hello,

I have been struggling a lot with getting kamal >1.0.0 working with cords.
I am not 100% sure if there is a bug or if I am doing something wrong.
Either way I think my questions / observations should either lead to a bugfix or to improved documentation text :)

Are cords automatic?

The docs leave me very unsure if we should be handling the cord and responding something non-200 OK in the healthcheck or if the healthchecker itself knows how to look for the cord. The release upgrade text reads like as if the healthcheck now simply uses the cord, but when we ran our second deploy on kamal 1.1.0 after upgrading from Mrsk it started failing with the very strange "failed healthy unhealthy" or something along those lines.

First of all, this error message could probably be a lot more informative like "cord is configured but container is still responding healthy". And again, there is nothing explicitly telling you to respond to this file to send different statuses to signal to Traefik whats up.

If cord is not automatic, initial deploy will always fail

Since the cord is only there on a rolling deploy, an initial deploy does not get a cord set up. I discovered this after doing a kamal app remove and doing a fresh deploy trying to handle the cord myself. If you are supposed to handle the cord yourself then this is currently broken for any new users and/or new server setups.

I have no idea how to reliably detect if a deploy is an initial one so I can't really do any heuristics to say if a missing cord should result in a 404 or not.

Edit: There is a kamal-cord directory on the fresh install, as I guess its a default container mount now, but the build logs have zero trace of cords being created and the regular touch ..cordstuff../cord

Edit2: cc @djmb as the cord PR author :)

@olivermt
Copy link
Author

olivermt commented Nov 4, 2023

Also if a cord check fails with healthy (unhealthy) thingy, things seems to fully break as it just exits, so deploy lock is also not released.. which I am not sure makes any sense.

@djmb
Copy link
Collaborator

djmb commented Nov 6, 2023

Hi @olivermt!

The cord should work automatically, so you shouldn't need to do anything yourself.

It does two things to your docker container:

  1. Mounts a volume into its /tmp directory and copies the "cord file" into that container
  2. Rewrites its healthcheck command to also check for the existence of the cord file

This allows us to delete the file to force the healthcheck to fail before we stop the container. It sounds maybe from your description that the healthcheck is not failing?

But to help debug could you share the logs from where it fails (redacting anything private from them!)?

Also do you have any custom configuration under the healthcheck key?

And could you extract the actual healthcheck from one of the containers? (via docker inspect)

@olivermt
Copy link
Author

olivermt commented Nov 14, 2023

Hello!

It sounds maybe from your description that the healthcheck is not failing?

Correct, it just keeps responding healthy.

I don't quite understand what you mean with the docker inspect, but what I can tell you is that it only fails on a host that has two separate files and deploys pointing at it (separated by labels on the services).

So there is probably something when you run two services on one host (which is supported, but not recommended?).

I am just gonna move this to another small host instead of forcing them onto the same one, but I can hold off if you want me to debug some more for you, if you want to make sure the two(or more)-on-same-host should work.

@djmb
Copy link
Collaborator

djmb commented Nov 15, 2023

@olivermt - the cord files are namespaced by the app and destination so two services shouldn't be an issue (but I've not tested that out).

Re: the docker healthcheck - sorry I meant could you run:

$ docker inspect <container_id> -f '{{ .Config.Healthcheck.Test }}'

It should output something like:

[CMD-SHELL (curl -f https://localhost:3000/up -m 5 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)]

Maybe there's something in the healthcheck that means that it always returns a healthy result?

@tsangiotis
Copy link

I am trying to deploy a django app with kamal. I cannot healthcheck with cord. The container seems unhealthy.

The command requested above gives the following:

$ docker inspect <container_id> -f '{{ .Config.Healthcheck.Test }}'
[CMD-SHELL (curl -f https://localhost:8000/health || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)]

I also added an ls /tmp/kamal-cord on my entrypoint and cannot find the directory. It is as if it is not there.

$ ls /tmp/kamal-cord
ls: cannot access '/tmp/kamal-cord': No such file or directory

@ShubhamPalriwala
Copy link

Facing the exact same issue with a NextJs App with Kamal + Traefik!

@wenderjean
Copy link

Do you guys have any updates in this issue? I'm getting the same here, my "replaced" containers never become unhealthy 🤔

@djmb
Copy link
Collaborator

djmb commented Mar 5, 2024

The cord file is a workaround to allow us to force the container to be unhealthy while it still can handle requests. Sending docker stop will work the wrong way round - the container will stop handling requests, then it will take Traefik a few seconds to notice - in the meantime it will be serving errors.

How it should work is:

Before starting a new container Kamal creates a new directory on the host and adds an empty file cord in it. This will look something like this in the logs:

Running /usr/bin/env mkdir -p .kamal/cords/my-app-production-<newsha> ; touch .kamal/cords/my-app-production-<newsha>/cord

The when starting the new container, it will map the directory into it (by default to /tmp/kamal-cord) and modify the healthcheck to add a check for the existence of that file.

This will look like:

Running docker run --detach --restart unless-stopped --name <container_name> <SNIP> --health-cmd "(<HEALTHCHECK>) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)" --health-interval "1s" --volume $(pwd)/.kamal/cords/my-app-production-<newsha>:/tmp/kamal-cord <SNIP>

Then when that new container has started up and is healthy, Kamal deletes the cord file from the old container, which should cause the healthcheck to fail and the container to be marked as unhealthy.

This will appear as:

Running /usr/bin/env rm -r /local/app/.kamal/cords/my-app-production-<oldsha>

Since the problem is that the container is not getting marked as unhealthy, that suggests the healthcheck is continuing to succeed, I guess either because the cord file is not deleted, or because we have a healthcheck that doesn't fail when it is deleted.

The way to debug this is to find out exactly what the health check is and then see what happens when you run it manually in the container.

So from the host run:

$ docker inspect <container_name> -f '{{ .Config.Healthcheck.Test }}'
[CMD-SHELL (curl -f https://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)]

Then exec into the container and see why that command doesn't return an error code:

$ docker exec -it <container_name> bash
# (curl -f https://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)
$ echo $?
0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants