Improve Dashboard, Repo List, Feed #1234

bradrydzewski · 2015-10-13T05:43:37Z

One major change that comes with 0.4 is that we no longer sync your complete repository list in the database (for reasons described in #776). Instead we use the GitHub API to fetch your repository list and your repository permissions live.

One disadvantage to this approach is that we don't have a simple way (like a database join) to tell Drone to quickly give me a list of all repositories I have access to, or a build feed for all repositories I have access to.

I believe the solution is to fetch the list of all repositories from GitHub and then to create a really big IN statement. The IN operator will get passed the list of all repository names from GitHub as demonstrated in the following example:

select *
from repos
where repo_full_name IN (?,?,?,?,?,?.....................,?,?,?,?)

I even have a function already stubbed out in the code:

func GetRepoListOf(db meddler.DB, listof []string) ([]*Repo, error) {
    var repos = []*Repo{}
    var length = len(listof)
    var qs = make([]string, length, length)
    var in = make([]interface{}, length, length)
    for i, repo := range listof {
        qs[i] = "?"
        in[i] = repo
    }
    var stmt = "SELECT * FROM repos WHERE repo_id IN (" + strings.Join(qs, ",") + ")"
    var err = meddler.QueryAll(db, &repos, database.Rebind(stmt), in...)
    return repos, err
}

According to the documentation sqlite can hold a maximum of 999 parameters in the IN statement, and mysql and postgres [1] can hold tens of thousands of parameters. I remember doing this about 10 years ago to query an Oracle database with a few million rows and it worked quite well. So I'm pretty confident this approach is technically feasible, however, I still want to understand the performance implications on large datasets for the database vendors that we support.

Until I have benchmarks that support the above design, we implement the following workarounds:

The /user/repos endpoint returns a list of all repositories to which the user has authored a commit, based on data in the builds table
The /user/feed endpoint returns a list of all builds the user has authored

I am proactively logging this issue to explain why the behavior is currently different and might appear broken, to assure everyone this is just a temporary workaround, and that I hope to have the proposed changes in place prior to the final 0.4 release.

[1] consider described optimization for postgres https://www.datadoghq.com/blog/100x-faster-postgres-performance-by-changing-1-line/

The text was updated successfully, but these errors were encountered:

bradrydzewski · 2015-10-17T01:21:39Z

I performed benchmarking of IN vs JOIN against a sqlite database on my Linux laptop. I populated the database with 5 million repositories which would represent the (unrealistic) scenario that every single public GitHub repository was using a single instance of Drone.

In 0.3 we sync a user's repository list with GitHub and persist in the database, using a JOIN to determine the intersection of repositories in GitHub that are also activated in Drone. In 0.4 we store the user's list of GitHub repositories in an in-memory LRU cache, and perform a giant IN statement to retrieve the list of repositories activated in Drone.

for reference, the above query drives the dashbaord page only

I simulated user accounts that had 250, 500 and 999 repositories with the following results:

250 repos

BenchmarkJoin-4             1000       1213005 ns/op
BenchmarkIN-4               1000       2011040 ns/op

500 repos

BenchmarkJoin-4              500       2793446 ns/op
BenchmarkIN-4                300       4619558 ns/op

999 repos

BenchmarkJoin-4              300       4717224 ns/op
BenchmarkIN-4                200       8358777 ns/op

The below chart displays the number of queries we are able to return per second (Y-axis) depending on how many repos the user has (X-axis). The red delta helps us visualize the overhead of the IN statement compared to a JOIN:

When the user has 250 repositories or less there is little or no overhead associated with using an IN statement. If the user has 500 or more repositories there is 30-40% overhead.

The tests demonstrate that with either approach, we can still serve hundreds of requests per second to users with 999 repositories. This does not simulate the fact that the database will be serving other queries, or may be locked for a write. The likelihood, however, that a private Drone installation would see that kind of volume accessing the dashboard page (which is where this query is executed) is very very low.

The increase in overhead for users with a large number of repositories may therefore be justified to remove the complexity of syncing. I still have the following concerns:

Would like to see a Postgres or Mysql benchmark as well
When in-memory cache of user repos is flushed (due to LRU or TTL) re-retrieving from GitHub is expensive and high-latency (3 seconds or more)
What if a user is watching > 999 GitHub repositories? Is this even realistic?

bradrydzewski · 2015-10-18T19:02:03Z

cc @benschumacher would love to hear your thoughts. I'm guessing that syncing repositories complicated your integrations.

bradrydzewski added this to the v0.4.0 milestone Oct 13, 2015

bradrydzewski added a commit that referenced this issue Oct 16, 2015

some initial work on #1234

fa8c657

bradrydzewski closed this as completed Oct 26, 2015

bradrydzewski mentioned this issue Feb 5, 2016

Add master build list endpoint for wall #1460

Closed

bradrydzewski mentioned this issue Feb 18, 2016

Paginate REST query #1494

Closed

bradrydzewski mentioned this issue Sep 14, 2016

adding paging to get repositories from the database #1793

Closed

bot2-harness pushed a commit that referenced this issue Apr 19, 2024

fix: [code-1802]: fix issue on select project (#1234)

175f26e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Dashboard, Repo List, Feed #1234

Improve Dashboard, Repo List, Feed #1234

bradrydzewski commented Oct 13, 2015

bradrydzewski commented Oct 17, 2015

bradrydzewski commented Oct 18, 2015

Improve Dashboard, Repo List, Feed #1234

Improve Dashboard, Repo List, Feed #1234

Comments

bradrydzewski commented Oct 13, 2015

bradrydzewski commented Oct 17, 2015

bradrydzewski commented Oct 18, 2015