-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Dashboard, Repo List, Feed #1234
Comments
I performed benchmarking of In 0.3 we sync a user's repository list with GitHub and persist in the database, using a for reference, the above query drives the dashbaord page only I simulated user accounts that had 250, 500 and 999 repositories with the following results:
The below chart displays the number of queries we are able to return per second (Y-axis) depending on how many repos the user has (X-axis). The red delta helps us visualize the overhead of the When the user has 250 repositories or less there is little or no overhead associated with using an The tests demonstrate that with either approach, we can still serve hundreds of requests per second to users with 999 repositories. This does not simulate the fact that the database will be serving other queries, or may be locked for a write. The likelihood, however, that a private Drone installation would see that kind of volume accessing the dashboard page (which is where this query is executed) is very very low. The increase in overhead for users with a large number of repositories may therefore be justified to remove the complexity of syncing. I still have the following concerns:
|
cc @benschumacher would love to hear your thoughts. I'm guessing that syncing repositories complicated your integrations. |
One major change that comes with 0.4 is that we no longer sync your complete repository list in the database (for reasons described in #776). Instead we use the GitHub API to fetch your repository list and your repository permissions live.
One disadvantage to this approach is that we don't have a simple way (like a database join) to tell Drone to quickly give me a list of all repositories I have access to, or a build feed for all repositories I have access to.
I believe the solution is to fetch the list of all repositories from GitHub and then to create a really big
IN
statement. TheIN
operator will get passed the list of all repository names from GitHub as demonstrated in the following example:I even have a function already stubbed out in the code:
According to the documentation sqlite can hold a maximum of 999 parameters in the
IN
statement, and mysql and postgres [1] can hold tens of thousands of parameters. I remember doing this about 10 years ago to query an Oracle database with a few million rows and it worked quite well. So I'm pretty confident this approach is technically feasible, however, I still want to understand the performance implications on large datasets for the database vendors that we support.Until I have benchmarks that support the above design, we implement the following workarounds:
/user/repos
endpoint returns a list of all repositories to which the user has authored a commit, based on data in thebuilds
table/user/feed
endpoint returns a list of all builds the user has authoredI am proactively logging this issue to explain why the behavior is currently different and might appear broken, to assure everyone this is just a temporary workaround, and that I hope to have the proposed changes in place prior to the final 0.4 release.
[1] consider described optimization for postgres https://www.datadoghq.com/blog/100x-faster-postgres-performance-by-changing-1-line/
The text was updated successfully, but these errors were encountered: