Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Dashboard, Repo List, Feed #1234

Closed
bradrydzewski opened this issue Oct 13, 2015 · 2 comments
Closed

Improve Dashboard, Repo List, Feed #1234

bradrydzewski opened this issue Oct 13, 2015 · 2 comments
Milestone

Comments

@bradrydzewski
Copy link

One major change that comes with 0.4 is that we no longer sync your complete repository list in the database (for reasons described in #776). Instead we use the GitHub API to fetch your repository list and your repository permissions live.

One disadvantage to this approach is that we don't have a simple way (like a database join) to tell Drone to quickly give me a list of all repositories I have access to, or a build feed for all repositories I have access to.

I believe the solution is to fetch the list of all repositories from GitHub and then to create a really big IN statement. The IN operator will get passed the list of all repository names from GitHub as demonstrated in the following example:

select *
from repos
where repo_full_name IN (?,?,?,?,?,?.....................,?,?,?,?)

I even have a function already stubbed out in the code:

func GetRepoListOf(db meddler.DB, listof []string) ([]*Repo, error) {
    var repos = []*Repo{}
    var length = len(listof)
    var qs = make([]string, length, length)
    var in = make([]interface{}, length, length)
    for i, repo := range listof {
        qs[i] = "?"
        in[i] = repo
    }
    var stmt = "SELECT * FROM repos WHERE repo_id IN (" + strings.Join(qs, ",") + ")"
    var err = meddler.QueryAll(db, &repos, database.Rebind(stmt), in...)
    return repos, err
}

According to the documentation sqlite can hold a maximum of 999 parameters in the IN statement, and mysql and postgres [1] can hold tens of thousands of parameters. I remember doing this about 10 years ago to query an Oracle database with a few million rows and it worked quite well. So I'm pretty confident this approach is technically feasible, however, I still want to understand the performance implications on large datasets for the database vendors that we support.

Until I have benchmarks that support the above design, we implement the following workarounds:

  1. The /user/repos endpoint returns a list of all repositories to which the user has authored a commit, based on data in the builds table
  2. The /user/feed endpoint returns a list of all builds the user has authored

I am proactively logging this issue to explain why the behavior is currently different and might appear broken, to assure everyone this is just a temporary workaround, and that I hope to have the proposed changes in place prior to the final 0.4 release.


[1] consider described optimization for postgres https://www.datadoghq.com/blog/100x-faster-postgres-performance-by-changing-1-line/

@bradrydzewski bradrydzewski added this to the v0.4.0 milestone Oct 13, 2015
bradrydzewski added a commit that referenced this issue Oct 16, 2015
@bradrydzewski
Copy link
Author

I performed benchmarking of IN vs JOIN against a sqlite database on my Linux laptop. I populated the database with 5 million repositories which would represent the (unrealistic) scenario that every single public GitHub repository was using a single instance of Drone.

In 0.3 we sync a user's repository list with GitHub and persist in the database, using a JOIN to determine the intersection of repositories in GitHub that are also activated in Drone. In 0.4 we store the user's list of GitHub repositories in an in-memory LRU cache, and perform a giant IN statement to retrieve the list of repositories activated in Drone.

for reference, the above query drives the dashbaord page only

I simulated user accounts that had 250, 500 and 999 repositories with the following results:

250 repos

BenchmarkJoin-4             1000       1213005 ns/op
BenchmarkIN-4               1000       2011040 ns/op

500 repos

BenchmarkJoin-4              500       2793446 ns/op
BenchmarkIN-4                300       4619558 ns/op

999 repos

BenchmarkJoin-4              300       4717224 ns/op
BenchmarkIN-4                200       8358777 ns/op

The below chart displays the number of queries we are able to return per second (Y-axis) depending on how many repos the user has (X-axis). The red delta helps us visualize the overhead of the IN statement compared to a JOIN:

image

When the user has 250 repositories or less there is little or no overhead associated with using an IN statement. If the user has 500 or more repositories there is 30-40% overhead.

The tests demonstrate that with either approach, we can still serve hundreds of requests per second to users with 999 repositories. This does not simulate the fact that the database will be serving other queries, or may be locked for a write. The likelihood, however, that a private Drone installation would see that kind of volume accessing the dashboard page (which is where this query is executed) is very very low.

The increase in overhead for users with a large number of repositories may therefore be justified to remove the complexity of syncing. I still have the following concerns:

  1. Would like to see a Postgres or Mysql benchmark as well
  2. When in-memory cache of user repos is flushed (due to LRU or TTL) re-retrieving from GitHub is expensive and high-latency (3 seconds or more)
  3. What if a user is watching > 999 GitHub repositories? Is this even realistic?

@bradrydzewski
Copy link
Author

cc @benschumacher would love to hear your thoughts. I'm guessing that syncing repositories complicated your integrations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant