Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the build in an async manner to improve response time to webhooks #2870

Closed
wants to merge 2 commits into from

Conversation

francoispqt
Copy link

Hi,

I will give the context of this PR:

We have been working on a configuration plugin to handle our monorepo and the solution we came up with does the following steps:

  • clones the repo based on the build event in the same way https://github.com/drone/drone-git does it
  • looks for diff files affected by the current build
  • for affected each files it traverses the repo upward until it finds a drone config file
  • parses these drone config files and passes them to a template engine to compute the root drone file of the repository and send back the final config

Leveraging the multi-pipeline feature we have been able to come up with a solution which handles each microservice individually, erasing the drawbacks of monorepos in classic CI environments.

A root drone file using this can look like the following:

kind: pipeline
type: docker
name: "lint"

steps:
- name: lint
  image: golang:1.12
  commands:
  - lint

{{ if .Pipelines }}
{{ .Pipelines | toYaml }}
---
kind: pipeline
type: docker
name: "deploy"

steps:
- name: deploy
  image: debian
  commands:
  - deploy

depends_on:
{{ .PipelinesName | toYaml }}
{{- end }}

What we have done works great and we are planning to make it available to the community.

Yet there is one problem, our monorepo is pretty big and even though we created a cache mechanism using a mirror repo as reference, it still takes between 5~8 secs to fetch the repository in the configuration plugin and it leads to timeouts in the Github webhooks as the timeout is set at 10sec.

Following the good practices advised by github here https://developer.github.com/v3/guides/best-practices-for-integrators/, we've decided to come up with this PR which makes the handling of a webhook more asynchronous.

With this PR, the triggerer starts a goroutine to do most of the heavy lifting to create the build, if an error occurs, it creates the build with an error message.

Thanks for the feedbacks

… with error if something goes wrong - the goal is to improve response time to webhook
@CLAassistant
Copy link

CLAassistant commented Nov 8, 2019

CLA assistant check
All committers have signed the CLA.

@bradrydzewski
Copy link

bradrydzewski commented Nov 8, 2019

First, thanks for the pull request 🙂

One concern I have is that always returning a build we give the impression a build was created, which is not always going to be the case. I see this causing confusion and resulting in support burden. For example, if the trigger condition is not satisfied the build is ignored, but a build is still returned in the hook body.

For this reason, we would probably be better off always returning a 200 OK with an empty body to avoid any confusion. This would simplify the pull request and would only require changing a few lines of code in the hook handler.

	go triggerer.Trigger(ctx, repo, hook)
	writeStatus(w, 200)

Before making this change we also need to understand whether or not anything useful is being sent back in the hook response on error, because we would be taking that away from users who may be using it to debug today. My gut tells me nothing useful is being returned, but we should verify anyway.

Lastly I had hoped one day to return more details in the response, including reasons a build may have been skipped. This will no longer be an option if we process the build async.

The timeout errors in GitHub can be unsightly, however, they do not prevent the trigger routing from executing and scheduling a build.

For the record I am not opposing this change, and I do think it is probably the right thing to do, but I wanted to document all the pros and cons and things we will need to consider so that others can weigh in.

@francoispqt
Copy link
Author

francoispqt commented Nov 8, 2019

Thanks for the quick response.

Calling the whole trigger async definitely works for our issue, but I'm not sure it's the best solution.

When writing the PR I was thinking that if we wanted the response to be useful we might as well just have the correct build number in the response, that should be enough to debug, at least it's enough to call the drone API build info endpoint, but I missed the build being skipped in which case the response sent is invalid. What we could do, though, is create the build with the status "skipped" and with a message about why it was skipped.

As for debugging from the webhooks response, we are doing it but only for issues with our configuration plugin because that's the only way although I think it's easier to debug from drone itself because not every contributor has access to the webhooks response but they do have access to their builds on the drone UI.

@bradrydzewski
Copy link

bradrydzewski commented Nov 8, 2019

What we could do though is create the build with the status "skipped" and with a message about why it was skipped.

this is not something I would want to do. This would create a lot of noise in the system and would create large gaps in build numbers, as well as increase database size.

@tboerger any thoughts?

maybe we store the last N webhooks that are sent to the system in a separate table / s3 and we periodically purge?

@francoispqt
Copy link
Author

this is not something I would want to do. This would create a lot of noise in the system and would create large gaps in build numbers.

I agree.

Trying to find another solution 🤔

@bradrydzewski
Copy link

bradrydzewski commented Nov 8, 2019

I think the optimal solution probably looks something like this:

  1. run trigger async and return a 200OK with an empty body
  2. store the full hook in a new table that is periodically pruned so that it does not grow too big (with an option to store the full hook payload in S3)
  3. store the build id, error or skip reason inside the same record as the hook
  4. provide an option for the user to view hooks received per-repository

not all systems provide access to webhook responses (such as Bitbucket) and it often requires admin access. So hook responses were never a great way to debug. Storing the data in the system and making it accessible via endpoint should make it way easier for users to debug.

@francoispqt
Copy link
Author

I like the idea. I think we can still make the response from the webhook useful doing something simple. What about that?

  1. Store the hook in a table right away and return 200OK with a payload like:
{
  "hook_id": 1,
  "link": "https://drone.com/api/hooks/1"
}
  1. start a goroutine to process the build, alternatively, a worker could poll the hook table for unprocessed hooks and process them (this adds delivery guarantee)
  2. update the hook entry along the way with metadata such as build id, error, skip reason etc.
  3. provide a hooks resource in the drone api and a hooks tab for each repo in the drone UI to see hooks history
  4. add a repo settings for hooks retention period and have a worker clearing hooks for every repo when hooks are older than the retention period

@bradrydzewski
Copy link

bradrydzewski commented Nov 9, 2019

yes, we'll have to come up with a more formal rfc (we don't want to rush into this) but I feel pretty good about the ideas being discussed so far. Interested to hear from @ashwilliams1 and @tboerger

@tboerger
Copy link

tboerger commented Nov 9, 2019

I like the idea to store the webhooks. It's always a pain to find the right webhook within github for debugging.

I also like listing the hooks in the ui, we could show attributes where you always got to toggle the hook view within github directly in the list, and we could even filter them.

@francoispqt
Copy link
Author

francoispqt commented Nov 9, 2019

I don't know what's the decision-making process but if we go ahead at some point, I'd be happy to work on the implementation :)

@tboerger
Copy link

tboerger commented Nov 9, 2019

Brad is the project owner, so he is the one that decides the final solution, but he asked us, the other maintainers or contributors, for our opinion to get to a decision.

@hitesharinga hitesharinga changed the base branch from master to drone October 4, 2023 02:47
@bot2-harness
Copy link
Collaborator

This PR has been automatically closed due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants