Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino pods goes down instantly while autoscale factor causes pods to terminate even if terminationGracePeriodseconds is set to 300 seconds #22483

Closed
hsushmitha opened this issue Jun 24, 2024 · 9 comments

Comments

@hsushmitha
Copy link

we have set terminationGracePeriodSeconds to 300s in trino coordinator and worker nodes. during autoscaling when the number of worker pods increase and decrease, pods terminate instantly without waiting for the queries in the pod to terminate.
we have set shutdown.grace-period=300s in trino cooridnator and worker also.
Expectation is the trino worker pods must wait for 300sec untill tasks in the worker complete instead of terminating instantly.

we have set starburstWorkerShutdownGracePeriodSeconds: 300 which corresponds to shutdown.grace-period=300s and deploymentTerminationGracePeriodSeconds: 300 which corresponds to terminationGracePeriodSeconds in starburst and the worker pods terminate after 300sec waiting for query tasks to run to completion as expected.

@nineinchnick
Copy link
Member

Is this about the Trino Helm chart? If yes, can you include the values to reproduce this?

@hsushmitha
Copy link
Author

it is about Trino Helm Chart. Attaching deployment config and values file for reproducing the issue.

values.txt
deployment-coordinator.txt
deployment-worker.txt

@nineinchnick
Copy link
Member

Which chart version you're using? How do you apply the changes you included in deployment-*.txt files?

In the latest chart version, you have to set coordinator.terminationGracePeriodSeconds and worker.terminationGracePeriodSeconds. See https://trinodb.github.io/charts/charts/trino/

@hsushmitha
Copy link
Author

we are using helm chart version: trino-0.8.0 we do helm upgrade trino . -f values.yaml -n trino and deploy the changes. the above attached files are yaml files.. since we couldn't attach yaml files we attached txt file version.

@nineinchnick
Copy link
Member

nineinchnick commented Jun 25, 2024

That's very old. I don't know how the chart was structured back then, and I can't help anymore. Can you try using the latest version?

@hsushmitha
Copy link
Author

we have upgraded the helm chart to 0.25.0, and the terminationGracePeriodSeconds is set to 300s. but still the trino pods are terminating instantly without being in terminating state for 300s.

@nineinchnick
Copy link
Member

I checked that the default Trino Docker image entrypoint doesn't handle signals sent to the container in any special way. The Trino server also doesn't do this. To handle graceful shutdown, you have to configure the pod's lifecycle in the worker.lifecycle section. See the Helm chart docs for an example.

@hsushmitha
Copy link
Author

HI, we have set lifecycle prestop hook and terminationGracePeriodSeconds in values.yaml

  lifecycle:
  # worker.lifecycle -- To enable [graceful
  # shutdown](https://trino.io/docs/current/admin/graceful-shutdown.html),
  # define a lifecycle preStop like bellow, Set the
  # `terminationGracePeriodSeconds` to a value greater than or equal to the
  # configured `shutdown.grace-period`. Configure `shutdown.grace-period` in
  # `additionalConfigProperties` as `shutdown.grace-period=2m` (default is 2
  # minutes). Also configure `accessControl` because the `default` system
  # access control does not allow graceful shutdowns.
  # @raw
  # Example:
  # ```yaml
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl -v -X PUT -d '\"SHUTTING_DOWN\"' -H \"Content-type: application/json\" -H \"X-Trino-User: trino\" https://localhost:8080/v1/info/state"]
  # ```

  terminationGracePeriodSeconds: 300

also we have set shutdown.grace-period in additionalWorkerConfigProperties

additionalWorkerConfigProperties:
  - shutdown.grace-period=300s

we still see the worker pods getting terminated abruptly without being in terminating state for 300s which is causing queries to fail. is there anything else that needs to be set to make sure pods have graceful shutdown.

@hashhar
Copy link
Member

hashhar commented Oct 14, 2024

if any of the tasks take longer than the termination grace period then queries are going to fail.

See docs at https://trino.io/docs/current/admin/graceful-shutdown.html which explain how graceful shutdown works.

The grace period hence needs to be at-least as long as the longest tasks (for simplicity assume queries) that execute on your cluster.

@trinodb trinodb locked and limited conversation to collaborators Oct 14, 2024
@hashhar hashhar converted this issue into discussion #23775 Oct 14, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Development

No branches or pull requests

3 participants