Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of labels handling in issue_search #26460

Merged
merged 6 commits into from
Jun 25, 2024

Conversation

xlii-chl
Copy link
Contributor

@xlii-chl xlii-chl commented Aug 11, 2023

This PR enhances the labels handling in issue_search by optimizing the SQL query and de-duplicating the ids when generating the query string.

Those patches are backportable to v1.20.2 as is.


Background

For some time now, BingBot and some other crawlers have been putting my Gitea instance on its knees with requests containing a lot of label ids, like this one :

[07/Aug/2023:11:28:37 +0200] "GET /Dolibarr/sendrecurringinvoicebymail/issues?q=&type=all&sort=&state=closed&labels=1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c2%2c10%2c2%2c1%2c1%2c10%2c10%2c7%2c6%2c10%2c10%2c3%2c2%2c1%2c5%2c10%2c1%2c6%2c2%2c7%2c3%2c7%2c6%2c10%2c1%2c10%2c1%2c1%2c7%2c7%2c1%2c1%2c1%2c1%2c10%2c10%2c1%2c2%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c2%2c1%2c12%2c6%2c6%2c10&milestone=0&project=-1&poster=0 HTTP/1.1" 499 0 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http:https://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"

Since each of the label ids implies a join, it grows exponentially expensive for the database engine (at least on PostgreSQL but SQLite suffers a little too).

Thus this PR proposes two enhancements:

  • rewrite the database query to use only one squashed condition,
  • deduplicate the label ids when generating the URL.

This is my first try at Golang so don't hesitate to point mistakes or inelegant code.

As a side note, I unfortunately don't have the start of this snowball in my webserver logs anymore, and couldn't really reproduce it (the labels in the above project don't have scopes, and the label id n°1 that we can see in the query string is linked to a deleted repository (but its labels still exist in database apparently)).

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 11, 2023
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 11, 2023
@techknowlogick techknowlogick added type/enhancement An improvement of existing functionality performance/speed performance issues with slow downs backport/v1.20 This PR should be backported to Gitea 1.20 labels Aug 11, 2023
@xlii-chl xlii-chl closed this Nov 17, 2023
@pull-request-size pull-request-size bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 17, 2023
@xlii-chl xlii-chl reopened this Nov 17, 2023
@pull-request-size pull-request-size bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 17, 2023
@xlii-chl
Copy link
Contributor Author

xlii-chl commented Nov 17, 2023

Rewrote using new Slices.Sort([]int64) & Compact([]int64) functions from Go 1.21.

Comment on lines +164 to +165
slices.Sort(labelQuerySlice)
labelQuerySlice = slices.Compact(labelQuerySlice)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use code.gitea.io/gitea/modules/container.Set. Then you don't need to sort and compact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review ! However, I used container.Set in my previous iteration and the code was less readable and the output's ordering was random (container.Set uses map).
I still use it in models/issues/issue_search.go where its Add() method and the on-the-fly de-duplication are quite handy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see that the output is not just used in the backend. Then you could just use the set with slices.Sort at the end but that is only better if labelQuerySlice is very large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think KN4CK3R 's suggestion is right. Managed to do some rewriting: #31497

@xlii-chl
Copy link
Contributor Author

This pull-request doesn't seem to draw interest so I guess I'll be closing it in a few weeks.

However, while not a security issue strictly speaking, I believe it can become a problem on loosely configured instances, and even help for a DoS. For example, on the quite-strictly-configured try.gitea.io instance with a limit at ~60 joins (sqlite ?) :

~$ time curl -s -o /dev/null 'https://try.gitea.io/czygsx/Gitea_2/issues?q=&type=all&sort=&state=open&labels=15516%2c15515&milestone=0&project=0&assignee=0&poster=0'

real	0m0,444s
user	0m0,105s
sys	0m0,021s


~$ time curl -s -o /dev/null 'https://try.gitea.io/czygsx/Gitea_2/issues?q=&type=all&sort=&state=open&labels=15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516%2c15515%2c15516&milestone=0&project=0&assignee=0&poster=0'
real	0m1,431s
user	0m0,116s
sys	0m0,013s

I hope you'll review it once more, and even if not, thanks for this awesome forge.

@lunny
Copy link
Member

lunny commented Jan 20, 2024

Sorry missed this one. It seems interesting. I will take a look at next week.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Jan 27, 2024
@lunny lunny added backport/v1.21 This PR should be backported to Gitea 1.21 and removed backport/v1.20 This PR should be backported to Gitea 1.20 labels Jan 27, 2024
@lunny lunny added this to the 1.22.0 milestone Jan 27, 2024
@lunny lunny removed the backport/v1.21 This PR should be backported to Gitea 1.21 label Mar 15, 2024
@lunny lunny modified the milestones: 1.22.0, 1.23.0 Mar 29, 2024
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Jun 25, 2024
@techknowlogick
Copy link
Member

Sorry, this PR got lost in the backlog. I'm currently going through the list of PRs still open and looking at getting them in.

@techknowlogick techknowlogick added backport/v1.22 This PR should be backported to Gitea 1.22 labels Jun 25, 2024
@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Jun 25, 2024
@techknowlogick techknowlogick merged commit b5326a4 into go-gitea:main Jun 25, 2024
26 checks passed
GiteaBot pushed a commit to GiteaBot/gitea that referenced this pull request Jun 26, 2024
This PR enhances the labels handling in issue_search by optimizing the
SQL query and de-duplicate the IDs when generating the query string.

---------

Co-authored-by: techknowlogick <[email protected]>
@GiteaBot GiteaBot added the backport/done All backports for this PR have been created label Jun 26, 2024
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jun 26, 2024

Sorry, this PR got lost in the backlog. I'm currently going through the list of PRs still open and looking at getting them in.

But it seems that there are still some nits unresolved, including @KN4CK3R 's concern. #26460 (review)

@wxiaoguang
Copy link
Contributor

-> Refactor issue label selection #31497

wxiaoguang added a commit that referenced this pull request Jun 26, 2024
zjjhot added a commit to zjjhot/gitea that referenced this pull request Jun 27, 2024
* giteaofficial/main:
  Refactor issue label selection (go-gitea#31497)
  Refactor dropzone (go-gitea#31482)
  [skip ci] Updated translations via Crowdin
  Optimization of labels handling in issue_search (go-gitea#26460)
  use correct l10n string (go-gitea#31487)
  Fix overflow menu flickering on mobile (go-gitea#31484)
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jun 28, 2024

@xlii-chl I saw that you said "weird ? The PR was sleeping for a year".

Since the hard-fork community is quite hostility to me, I could only answer your question here.

Why this PR becomes stale?

Because the review suggestion is not resolved and you didn't respond for long time, and the code style doesn't quite match Golang standard. Indeed it needs more work.

If you could answer the review suggestion (either apply it or explain why not doing that), it could move on easily.

What happened to the hard-fork?

Well, that's a long story, in short:

  • They keeps criticizing Gitea community/maintainers and made a lot of lies, and they copied a lot of Gitea's feature and publicize that "they are their work" (for example: Gitea Actions).
  • And be careful about their cherry-picks and PRs, I am sure there are not only one security problems (include XSS, permission bypass) in their codebase but not in Gitea.

And here are why many of theirs PRs to Gitea were not merged mainly because there are bugs and they do not respond.

Let's see why some of them are not merged (I only picked some my reviews):

Really suggest everyone to take a deep look ........

DennisRasey pushed a commit to DennisRasey/forgejo that referenced this pull request Jun 28, 2024
This PR optimizes the SQL query and de-duplicate the labels' ids when generating the query string, on the issue page.

<hr/>

### Background

Some time ago, BingBot and some other crawlers have been putting my instance on its knees with requests containing a lot of label ids, like this one :

```
[07/Aug/2023:11:28:37 +0200] "GET /Dolibarr/sendrecurringinvoicebymail/issues?q=&type=all&sort=&state=closed&labels=1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c2%2c10%2c2%2c1%2c1%2c10%2c10%2c7%2c6%2c10%2c10%2c3%2c2%2c1%2c5%2c10%2c1%2c6%2c2%2c7%2c3%2c7%2c6%2c10%2c1%2c10%2c1%2c1%2c7%2c7%2c1%2c1%2c1%2c1%2c10%2c10%2c1%2c2%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c1%2c2%2c1%2c12%2c6%2c6%2c10&milestone=0&project=-1&poster=0 HTTP/1.1" 499 0 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http:https://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
```

Since each of the label ids implies a join, it grows exponentially expensive for the database engine (at least on PostgreSQL but SQLite suffers a little too).

Thus, this PR proposes two enhancements:

* rewrite the database query to use only one squashed condition,
* deduplicate the label ids when generating the URL.

### Performance comparison

Here are some timings on Postgresql-backed, Forgejo 7.0.4 instances :
```sh
$ time curl -s -o /dev/null "http:https://localhost:3000/toto/tata/issues?q=&type=all&sort=&labels=19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25&state=open&milestone=0&project=0&assignee=0&poster=0"

real    0m10,491s
user    0m0,017s
sys     0m0,008s
```
...and with the patch:
```sh
$ time curl -s -o /dev/null "http:https://localhost:3000/toto/tata/issues?q=&type=all&sort=&labels=19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25%2c19%2c25&state=open&milestone=0&project=0&assignee=0&poster=0"

real    0m0,094s
user    0m0,012s
sys     0m0,013s
```

### Annex

This issue was originally proposed to [Gitea](go-gitea/gitea#26460) but didn't get much attention, and I switched to Forgejo in the meantime :)

Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/4228
Reviewed-by: Earl Warren <[email protected]>
Co-authored-by: Chl <[email protected]>
Co-committed-by: Chl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/done All backports for this PR have been created backport/v1.22 This PR should be backported to Gitea 1.22 lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/go Pull requests that update Go code performance/speed performance issues with slow downs size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants