Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out pending stages from killed builds in queue endpoint #2653

Closed
wants to merge 1 commit into from
Closed

Filter out pending stages from killed builds in queue endpoint #2653

wants to merge 1 commit into from

Conversation

MorrisJobke
Copy link

Those pending stages are caused by cancelling a multi-pipeline job while not all stages are started yet. Then the later stages are still in a pending state. This causes trouble with the autoscaler because the autoscaler uses those pending stages to determine the backpressure and starts workers.

Another approach would be to properly kill those pending stages, but my Golang skills are currently too limited to accomplish this. Also further debugging would be needed to determine why it didn't worked in the first place. So this failsafe here is a good additional step to avoid unneeded upscale of agents.

The table looks like this right now:

Full table with pending stages with a query before and after this change
queryUnfinishedMysql from store/stage/stage.go with additional builds table joined
mysql> SELECT
    ->  stage_id
    -> ,stage_build_id
    -> ,stage_number
    -> ,stage_name
    -> ,stage_status
    -> ,stage_machine
    -> ,stage_started
    -> ,stage_stopped
    -> ,stage_created
    -> ,stage_updated
    -> ,b.build_source_repo
    -> ,b.build_status
    -> ,b.build_number
    -> FROM stages s
    -> JOIN builds b
    ->   ON s.stage_build_id = b.build_id
    -> WHERE stage_id IN (SELECT stage_id FROM stages_unfinished)
    ->   AND stage_status IN ('pending','running')
    -> ORDER BY stage_id ASC;
+----------+----------------+--------------+-------------------------------------------------+--------------+----------------+---------------+---------------+---------------+---------------+-------------------+--------------+--------------+
| stage_id | stage_build_id | stage_number | stage_name                                      | stage_status | stage_machine  | stage_started | stage_stopped | stage_created | stage_updated | build_source_repo | build_status | build_number |
+----------+----------------+--------------+-------------------------------------------------+--------------+----------------+---------------+---------------+---------------+---------------+-------------------+--------------+--------------+
|  5345086 |          51636 |           69 | memcache-redis-cluster                          | pending      | drone-worker2  |             0 |             0 |    1554412702 |    1554413411 | nextcloud/server  | killed       |        17541 |
|  5345087 |          51636 |           70 | ui-regression                                   | pending      | drone-worker2  |             0 |             0 |    1554412702 |    1554413412 | nextcloud/server  | killed       |        17541 |
|  5345256 |          51644 |           28 | integration-ocs-v1                              | pending      | drone-worker2  |             0 |             0 |    1554413599 |    1554413665 | nextcloud/server  | killed       |        17549 |
|  5345257 |          51644 |           29 | integration-sharing-v1                          | pending      | drone-worker2  |             0 |             0 |    1554413599 |    1554413665 | nextcloud/server  | killed       |        17549 |
|  5345258 |          51644 |           30 | integration-sharing-v1-part2                    | pending      | drone-worker2  |             0 |             0 |    1554413599 |    1554413666 | nextcloud/server  | killed       |        17549 |
|  5345259 |          51644 |           31 | integration-sharing-v1-part3                    | pending      | drone-worker2  |             0 |             0 |    1554413599 |    1554413667 | nextcloud/server  | killed       |        17549 |
...
|  5348653 |          51856 |           63 | acceptance-app-files-tags                       | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725432 | nextcloud/server  | killed       |        17612 |
|  5348654 |          51856 |           64 | acceptance-app-theming                          | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725432 | nextcloud/server  | killed       |        17612 |
|  5348655 |          51856 |           65 | acceptance-header                               | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725432 | nextcloud/server  | killed       |        17612 |
|  5348656 |          51856 |           66 | acceptance-login                                | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348657 |          51856 |           67 | acceptance-users                                | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348658 |          51856 |           68 | acceptance-apps                                 | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348659 |          51856 |           69 | nodb-codecov                                    | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348660 |          51856 |           70 | db-codecov                                      | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348661 |          51856 |           71 | object-store-s3                                 | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348662 |          51856 |           72 | object-store-azure                              | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348663 |          51856 |           73 | memcache-memcached                              | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348664 |          51856 |           74 | memcache-redis-cluster                          | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5348665 |          51856 |           75 | ui-regression                                   | pending      | drone-worker2  |             0 |             0 |    1554722507 |    1554725433 | nextcloud/server  | killed       |        17612 |
|  5351655 |          51934 |           24 | mysql-php7.0                                    | running      | drone          |    1554758462 |             0 |    1554758462 |    1554758462 |                   | running      |        17648 |
|  5351706 |          51934 |           75 | acceptance-app-files-sharing                    | running      | drone          |    1554758758 |             0 |    1554758462 |    1554758759 |                   | running      |        17648 |
|  5351715 |          51934 |           84 | db-codecov                                      | running      | agent-LtQXALYh |    1554758826 |             0 |    1554758462 |    1554758826 |                   | running      |        17648 |
+----------+----------------+--------------+-------------------------------------------------+--------------+----------------+---------------+---------------+---------------+---------------+-------------------+--------------+--------------+
257 rows in set (0.00 sec)

updated queryUnfinishedMysql from store/stage/stage.go with additional builds table joined to filter out the unneeded pending stages
mysql> SELECT
    ->  stage_id
    -> ,stage_repo_id
    -> ,stage_build_id
    -> ,stage_number
    -> ,stage_name
    -> ,stage_kind
    -> ,stage_type
    -> ,stage_status
    -> ,stage_error
    -> ,stage_errignore
    -> ,stage_exit_code
    -> ,stage_limit
    -> ,stage_os
    -> ,stage_arch
    -> ,stage_variant
    -> ,stage_kernel
    -> ,stage_machine
    -> ,stage_started
    -> ,stage_stopped
    -> ,stage_created
    -> ,stage_updated
    -> ,stage_version
    -> ,stage_on_success
    -> ,stage_on_failure
    -> ,stage_depends_on
    -> ,stage_labels
    -> FROM stages
    ->  JOIN builds
    ->    ON stages.stage_build_id = builds.build_id
    -> WHERE stage_id IN (SELECT stage_id FROM stages_unfinished)
    ->   AND stage_status IN ('pending','running')
    ->   AND build_status <> 'killed'
    -> ORDER BY stage_id ASC;
+----------+---------------+----------------+--------------+------------------------------+------------+------------+--------------+-------------+-----------------+-----------------+-------------+----------+------------+---------------+--------------+----------------+---------------+---------------+---------------+---------------+---------------+------------------+------------------+------------------+--------------+
| stage_id | stage_repo_id | stage_build_id | stage_number | stage_name                   | stage_kind | stage_type | stage_status | stage_error | stage_errignore | stage_exit_code | stage_limit | stage_os | stage_arch | stage_variant | stage_kernel | stage_machine  | stage_started | stage_stopped | stage_created | stage_updated | stage_version | stage_on_success | stage_on_failure | stage_depends_on | stage_labels |
+----------+---------------+----------------+--------------+------------------------------+------------+------------+--------------+-------------+-----------------+-----------------+-------------+----------+------------+---------------+--------------+----------------+---------------+---------------+---------------+---------------+---------------+------------------+------------------+------------------+--------------+
|  5351655 |             1 |          51934 |           24 | mysql-php7.0                 |            |            | running      |             |               0 |               0 |           0 | linux    | amd64      |               |              | drone          |    1554758462 |             0 |    1554758462 |    1554758462 |             3 |                1 |                0 | null             | null         |
|  5351706 |             1 |          51934 |           75 | acceptance-app-files-sharing |            |            | running      |             |               0 |               0 |           0 | linux    | amd64      |               |              | drone          |    1554758758 |             0 |    1554758462 |    1554758759 |             3 |                1 |                0 | null             | null         |
|  5351715 |             1 |          51934 |           84 | db-codecov                   |            |            | running      |             |               0 |               0 |           0 | linux    | amd64      |               |              | agent-LtQXALYh |    1554758826 |             0 |    1554758462 |    1554758826 |             3 |                1 |                0 | null             | null         |
+----------+---------------+----------------+--------------+------------------------------+------------+------------+--------------+-------------+-----------------+-----------------+-------------+----------+------------+---------------+--------------+----------------+---------------+---------------+---------------+---------------+---------------+------------------+------------------+------------------+--------------+
3 rows in set (0.00 sec)

Those pending stages are caused by cancelling a multi-pipeline job while not all stages are started yet. Then the later stages are still in a pending state. This causes trouble with the autoscaler because the autoscaler uses those pending stages to determine the backpressure and starts workers.

Signed-off-by: Morris Jobke <[email protected]>
@MorrisJobke
Copy link
Author

Those pending stages are caused by cancelling a multi-pipeline job while not all stages are started yet.

I just tested this again and it is actually reproducible:

  • have a new CI job triggered with many multi-machine steps
  • when some of them are processed and some are still in pending press the "cancel" button
  • then only the already started ones get into the state "killed" and then pending ones keep in the pending state.

Copy link

@bradrydzewski bradrydzewski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not been able to reproduce the issue yet, but either way I would prefer a solution that does not add a database join. The join is suboptimal and feels like more of a workaround. It will eventually result in database performance issues and records that build up over time. We use optimistic locking in Drone which should prevent these situations. Instead of a workaround I would prefer we identify the root cause and fix the problem at its source.

@MorrisJobke
Copy link
Author

I can reproduce this here with a MySQL DB. Do you have any steps I can do to debug this? Does the debug look help with that?

@genisd
Copy link

genisd commented Jul 4, 2019

we do have this behaviour as well. I have to cleanup the pending stuff on a regular basis to avoid having idle worker servers running endlessly by the autoscaler. Even if this is a workaround, i'd rather not do this endlessly myself :(

Debugging the root cause is not easy. I always find it hard to follow how things communicate (the timings of it for example).
I agree with @MorrisJobke the cause lies somewhere in cancelling running builds.

Edit/Update: We're running this in prod as of 19 juli, so far so good :-)
Just one little detail, this PR only adjusts the MySQL driver query, we're running sqlite so I essnetially took the exact same adjustment for the sqlite query. So if we were to ship something like this we should do the adjustment in the sqlite driver as well

@bradrydzewski
Copy link

while I appreciate the proactive nature of the pull request, it does not address the actual root cause which is how stages can be stuck in a pending state when the parent build is in a finished state (note that I have not experienced this behavior personally). The fix is not filtering these out of the database query, the fix is to prevent this invalid state in the first place. I will accept a pull request that does the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants