Add Vegeta rates / targets to SLA in performance tests #14429

xiangpingjiang · 2023-09-25T06:58:22Z

Proposed Changes

Release Note

Add Vegeta rates / targets to SLA in performance tests

knative-prow · 2023-09-25T06:58:31Z

Hi @xiangpingjiang. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2023-09-25T14:40:13Z

/ok-to-test

dprotaso · 2023-09-25T14:40:35Z

/test performance-tests

codecov · 2023-09-25T14:57:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (81149da) 86.05% compared to head (59049d2) 86.02%.
Report is 23 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14429      +/-   ##
==========================================
- Coverage   86.05%   86.02%   -0.04%     
==========================================
  Files         197      197              
  Lines       14937    14945       +8     
==========================================
+ Hits        12854    12856       +2     
- Misses       1774     1778       +4     
- Partials      309      311       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

test/performance/benchmarks/load-test/main.go

test/performance/benchmarks/reconciliation-delay/main.go

skonto · 2023-09-25T17:33:00Z

SLAs seem not to be respected (previous SLAs unrelated to the PR as well, see bellow).
We need to control the variance somehow eg. evenly distribute pods and make sure these tests pass.

2023/09/25 15:00:02 Cleaning up all created services
2023/09/25 15:00:07 Shutting down InfluxReporter
2023/09/25 15:00:07 SLA 1 failed. Errors occurred: 1
job.batch "real-traffic-test" deleted

503 Service Unavailable
2023/09/25 16:05:48 Shutting down InfluxReporter
2023/09/25 16:05:48 SLA 1 failed. P95 latency is not in 100-110ms time range: 177.512182ms
job.batch "rollout-probe-queue-direct" deleted

2023/09/25 15:05:18 SLA 1 passed. P95 latency is in 100000000-105000000ms time range
2023/09/25 15:05:18 Shutting down InfluxReporter
2023/09/25 15:05:18 SLA 2 failed. vegeta rate is 0.001
job.batch "dataplane-probe-deployment" deleted

2023/09/25 15:08:28 SLA 1 passed. P95 latency is in 100000000-110000000ms time range
2023/09/25 15:08:28 Shutting down InfluxReporter
2023/09/25 15:08:28 SLA 2 failed. vegeta rate is 0.001
job.batch "dataplane-probe-activator" deleted
service.serving.knative.dev "activator" deleted

2023/09/25 15:11:37 SLA 1 passed. P95 latency is in 100000000-110000000ms time range
2023/09/25 15:11:37 Shutting down InfluxReporter
Status Codes  [code:count]                      200:180000  
Error Set:
2023/09/25 15:11:37 SLA 2 failed. vegeta rate is 0.001
job.batch "dataplane-probe-queue" deleted

Error Set:
2023/09/25 15:26:47 SLA 1 passed. Amount of ready services is within the expected range. Is: 179.000000, expected: 174.000000-180.000000
2023/09/25 15:26:47 SLA 2 passed. P95 latency is in 0-25s time range
2023/09/25 15:26:50 SLA 3 failed. vegeta rate is 1253

Error Set:
2023/09/25 15:27:07 SLA 1 failed. P95 latency is not in 0-15000000ms time range: 34.532628ms
job.batch "scale-from-zero-1" deleted
service.serving.knative.dev "perftest-scalefromzero-00-bxwarrca" deleted

Error Set:
2023/09/25 15:27:26 Shutting down InfluxReporter
2023/09/25 15:27:26 SLA 1 failed. P95 latency is not in 0-15000000ms time range: 40.358438ms
job.batch "scale-from-zero-5" deleted

2023/09/25 15:40:46 SLA 1 passed. P95 latency is in 100-115ms time range
2023/09/25 15:40:46 SLA 2 passed. Max latency is below 10s
2023/09/25 15:40:46 SLA 3 passed. No errors occurred
2023/09/25 15:40:46 Shutting down InfluxReporter
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:719995  
Error Set:
2023/09/25 15:40:46 SLA 4 failed. total requests is 719995
job.batch "load-test-zero" deleted
service.serving.knative.dev "load-test-zero" deleted


Status Codes  [code:count]                      200:719997  
Error Set:
2023/09/25 15:48:00 SLA 1 passed. P95 latency is in 100-115ms time range
2023/09/25 15:48:00 SLA 2 passed. Max latency is below 10s
2023/09/25 15:48:00 SLA 3 passed. No errors occurred
2023/09/25 15:48:00 Shutting down InfluxReporter
2023/09/25 15:48:00 SLA 4 failed. total requests is 719997
job.batch "load-test-always" deleted

2023/09/25 15:55:13 SLA 1 passed. P95 latency is in 100-115ms time range
2023/09/25 15:55:13 SLA 2 passed. Max latency is below 10s
2023/09/25 15:55:13 SLA 3 passed. No errors occurred
2023/09/25 15:55:13 Shutting down InfluxReporter
2023/09/25 15:55:13 SLA 4 failed. total requests is 719998
job.batch "load-test-200" deleted

(Client.Timeout exceeded while awaiting headers)
Get "https://activator-with-cc.default.svc.cluster.local?sleep=100": dial tcp 0.0.0.0:0->10.88.7.84:80: connect: connection refused (Client.Timeout exceeded while awaiting headers)
2023/09/25 15:59:03 SLA 1 failed. P95 latency is not in 100-110ms time range: 1m6.797546265s
job.batch "rollout-probe-activator-direct" deleted
service.serving.knative.dev "activator-with-cc" deleted
=============================================

2023/09/25 16:05:48 Shutting down InfluxReporter
2023/09/25 16:05:48 SLA 1 failed. P95 latency is not in 100-110ms time range: 177.512182ms
job.batch "rollout-probe-queue-direct" deleted
service.serving.knative.dev "queue-proxy-with-cc" deleted

dprotaso · 2023-10-17T02:11:20Z

/ok-to-test

dprotaso · 2023-10-17T02:11:38Z

/test performance-tests

cc @ReToCode

test/performance/benchmarks/real-traffic-test/main.go

test/performance/benchmarks/rollout-probe/main.go

ReToCode

Just minor things, other than that it looks good.

test/performance/benchmarks/real-traffic-test/main.go

test/performance/benchmarks/scale-from-zero/main.go

knative-prow · 2023-12-13T13:10:09Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: xiangpingjiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xiangpingjiang · 2023-12-13T13:10:27Z

@xiangpingjiang gentle ping for rebasing.

@skonto done

skonto · 2023-12-13T13:21:26Z

/test performance-tests

ReToCode · 2023-12-14T06:54:48Z

Maybe we are a bit too restrictive at some tests:

2023/12/13 14:11:45 SLA 3 failed. total requests is 97, expected total requests is 100

2023/12/13 14:21:48 SLA 4 failed. total requests is 719998, expected total requests is 720000

2023/12/13 14:46:40 SLA 2 failed. vegeta rate is 1000.004559, expected Rate is 1000.000000

2023/12/13 14:40:03 SLA 2 failed. vegeta rate is 1000.001162, expected Rate is 1000.000000

skonto · 2023-12-14T10:02:14Z

Looking at the errors I see:

"scale-from-zero-100"

2023/12/13 14:11:45 SLA 3 failed. total requests is 97, expected total requests is 100

This one should fail because I suspect that 3 out 100 services were not ready, given that we manually add the data point. I think this one needs further debugging as it seems we got stuck here:

serving/test/performance/benchmarks/scale-from-zero/main.go

Line 303 in 3dbeba0

_, err := pkgTest.WaitForEndpointStateWithTimeout(

, as no errors are returned:

2023/12/13 14:11:45 Shutting down InfluxReporter
Requests [total, rate, throughput] 97, 97.00, 0.00
Duration [total, attack, wait] 20.097s, 0s, 20.097s
Latencies [min, mean, 50, 90, 95, 99, max] 95.181ms, 7.725s, 6.981s, 17.895s, 19.132s, 20.04s, 20.097s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 0.00%
Status Codes [code:count] 0:97
Error Set:
2023/12/13 14:11:45 SLA 3 failed. total requests is 97, expected total requests is 100

"load-test-zero"

2023/12/13 14:21:48 SLA 4 failed. total requests is 719998, expected total requests is 720000

Here I suspect that instead of:

	for i := 0; i < len(pacers); i++ {
		expectedRequests = expectedRequests + uint64(pacers[i].Rate(time.Second)*durations[i].Seconds())
	}

it may help to do:

	var expectedSum float64
	for i := 0; i < len(pacers); i++ {
		expectedSum = expectedSum + pacers[i].Rate(time.Second)*durations[i].Seconds()
	}
	expectedRequests = uint64(expectedSum)

rollout-probe-queue-direct

2023/12/13 14:46:40 SLA 2 failed. vegeta rate is 1000.004559, expected Rate is 1000.000000

rollout-probe-activator-direct

2023/12/13 14:40:03 SLA 2 failed. vegeta rate is 1000.001162, expected Rate is 1000.000000

We have more like the above:

rollout-probe-activator-direct-lin

2023/12/13 14:43:18 SLA 2 failed. vegeta rate is 1000.000960, expected Rate is 1000.000000

Wrt rate comparison and for the constant pacers we have:

// Rate returns a ConstantPacer's instantaneous hit rate (i.e. requests per second)
// at the given elapsed duration of an attack. Since it's constant, the return
// value is independent of the given elapsed duration.
func (cp ConstantPacer) Rate(elapsed time.Duration) float64 {
	return cp.hitsPerNs() * 1e9
}

// hitsPerNs returns the attack rate this ConstantPacer represents, in
// fractional hits per nanosecond.
func (cp ConstantPacer) hitsPerNs() float64 {
	return float64(cp.Freq) / float64(cp.Per)
}

I think here just rounding the observed rate is enough, eg. 1000.001162 -> 1000

Signed-off-by: pingjiang <[email protected]>

ReToCode · 2024-01-03T07:17:29Z

/test performance-tests

skonto · 2024-01-09T09:00:50Z

It seems we are still getting the round errors. Re-running to make sure the run was the latest:
/test performance-tests

dprotaso · 2024-01-09T17:50:22Z

/test performance-tests

ReToCode · 2024-01-10T08:35:31Z

Yep, we still have issues:

job.batch/load-test-zero created
pod/load-test-zero-b99mf condition met
{"level":"info","ts":1704825797.3743527,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2024/01/09 18:43:17 Starting the load test.
2024/01/09 18:44:49 All pods are done (scaled to zero) or terminating after 1m32.001094628s
Requests      [total, rate, throughput]         719998, 2000.00, 1999.42
Duration      [total, attack, wait]             6m0s, 6m0s, 103.497ms
Latencies     [min, mean, 50, 90, 95, 99, max]  101.447ms, 104.7ms, 102.927ms, 103.859ms, 104.299ms, 110.409ms, 1.68s
Bytes In      [total, mean]                     17200124, 23.89
Bytes Out     [total, mean]                     0, 0.00
2024/01/09 18:50:49 SLA 1 passed. P95 latency is in 100-115ms time range
2024/01/09 18:50:49 SLA 2 passed. Max latency is below 10s
2024/01/09 18:50:49 SLA 3 passed. No errors occurred
2024/01/09 18:50:49 Shutting down InfluxReporter
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:719998  
Error Set:
2024/01/09 18:50:49 SLA 4 failed. total requests is 719998, expected total requests is 720000

@xiangpingjiang can you add a threshold in that test?

xiangpingjiang · 2024-01-12T03:14:41Z

hello @ReToCode
Do you mean add a range like [expectedRequests-5,expectedRequests+5] ?

ReToCode · 2024-01-12T06:43:07Z

Yeah, or maybe in %, like we accept a deviation of 0.1% or something like this.

Signed-off-by: pingjiang <[email protected]>

xiangpingjiang · 2024-01-12T07:57:36Z

/test performance-tests

test/performance/benchmarks/load-test/main.go

Signed-off-by: pingjiang <[email protected]>

xiangpingjiang · 2024-01-12T15:13:15Z

/test performance-tests

dprotaso · 2024-01-14T14:57:10Z

@ReToCode I still see SLA failures in the performance test - but fixing them seems out of scope for this PR. (unless the SLA is becoming computed incorrectly)

Another thing that would be useful is if we fail the performance test to surface the SLA failures have happened

ReToCode · 2024-01-15T08:06:06Z

@ReToCode I still see SLA failures in the performance test - but fixing them seems out of scope for this PR. (unless the SLA is becoming computed incorrectly)

The SLAs were not constantly stable from the beginning, but this is a separate topic that we need to look into. So let's get this in, as it's better than before.

Another thing that would be useful is if we fail the performance test to surface the SLA failures have happened

Yeah, but probably after we make them stable, otherwise we only have red builds and/or partial test results in influxdb.

/lgtm

Thanks @xiangpingjiang for doing this!

skonto · 2024-01-15T11:29:14Z

@ReToCode @dprotaso Rounding errors or inaccurate conditions should not be fixed in this PR?
We are adding more failing points and not sure how that helps. I am not sure how we distinguish between a failure in rate due to some other reason vs the one here which is due to inaccuracy.

For example for scale-from-zero-25:

2024/01/12 16:02:10 SLA 3 failed. total requests is 24, expected total requests is 25

ReToCode · 2024-01-15T12:33:39Z

@ReToCode @dprotaso Rounding errors or inaccurate conditions should not be fixed in this PR?

I partially agree. The SLAs were never really stable to begin with (not even before my refactoring PR). We should look into that topic separately, aside from this specific PR (or phrased differently, I would not revert it for that). I created this issue: #14793 to follow up on this.

knative-prow bot requested review from evankanderson and mgencur September 25, 2023 06:58

knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Sep 25, 2023

skonto added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 25, 2023

knative-prow bot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 25, 2023

skonto reviewed Sep 25, 2023

View reviewed changes

test/performance/benchmarks/load-test/main.go Outdated Show resolved Hide resolved

skonto reviewed Sep 25, 2023

View reviewed changes

test/performance/benchmarks/reconciliation-delay/main.go Outdated Show resolved Hide resolved

skonto reviewed Sep 25, 2023

View reviewed changes

test/performance/benchmarks/reconciliation-delay/main.go Outdated Show resolved Hide resolved

skonto reviewed Sep 25, 2023

View reviewed changes

test/performance/benchmarks/reconciliation-delay/main.go Outdated Show resolved Hide resolved

xiangpingjiang marked this pull request as draft September 27, 2023 13:22

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2023

xiangpingjiang changed the title ~~Add Vegeta rates / targets to SLA in performance tests~~ [WIP] Add Vegeta rates / targets to SLA in performance tests Oct 9, 2023

dprotaso mentioned this pull request Oct 18, 2023

Fix sec context and resources for performance jobs #14529

Merged

ReToCode reviewed Oct 19, 2023

View reviewed changes

test/performance/benchmarks/real-traffic-test/main.go Outdated Show resolved Hide resolved

ReToCode reviewed Oct 19, 2023

View reviewed changes

test/performance/benchmarks/real-traffic-test/main.go Outdated Show resolved Hide resolved

ReToCode reviewed Oct 19, 2023

View reviewed changes

test/performance/benchmarks/rollout-probe/main.go Outdated Show resolved Hide resolved

ReToCode reviewed Oct 19, 2023

View reviewed changes

skonto reviewed Oct 19, 2023

View reviewed changes

test/performance/benchmarks/real-traffic-test/main.go Outdated Show resolved Hide resolved

skonto reviewed Oct 19, 2023

View reviewed changes

test/performance/benchmarks/scale-from-zero/main.go Outdated Show resolved Hide resolved

skonto reviewed Oct 19, 2023

View reviewed changes

test/performance/benchmarks/scale-from-zero/main.go Show resolved Hide resolved

Merge branch 'main' into performance

3dbeba0

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 13, 2023

xiangpingjiang requested review from skonto and dprotaso December 13, 2023 13:10

xiangpingjiang and others added 2 commits December 27, 2023 00:32

Merge branch 'knative:main' into performance

2e095c5

fix after review

3b4289e

Signed-off-by: pingjiang <[email protected]>

add a deviation to vegeta total requests test

b8aba8f

Signed-off-by: pingjiang <[email protected]>

ReToCode reviewed Jan 12, 2024

View reviewed changes

test/performance/benchmarks/load-test/main.go Outdated Show resolved Hide resolved

add threshold in vegeta total requests check

59049d2

Signed-off-by: pingjiang <[email protected]>

knative-prow bot assigned ReToCode Jan 15, 2024

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2024

knative-prow bot merged commit 8162fe2 into knative:main Jan 15, 2024
56 checks passed

ReToCode mentioned this pull request Jan 15, 2024

Investigate unstable performance-test SLAs #14793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Vegeta rates / targets to SLA in performance tests #14429

Add Vegeta rates / targets to SLA in performance tests #14429

xiangpingjiang commented Sep 25, 2023

knative-prow bot commented Sep 25, 2023

dprotaso commented Sep 25, 2023

dprotaso commented Sep 25, 2023

codecov bot commented Sep 25, 2023 •

edited

Loading

skonto commented Sep 25, 2023 •

edited

Loading

dprotaso commented Oct 17, 2023

dprotaso commented Oct 17, 2023

ReToCode left a comment

knative-prow bot commented Dec 13, 2023

xiangpingjiang commented Dec 13, 2023

skonto commented Dec 13, 2023

ReToCode commented Dec 14, 2023

skonto commented Dec 14, 2023 •

edited

Loading

ReToCode commented Jan 3, 2024

skonto commented Jan 9, 2024 •

edited

Loading

dprotaso commented Jan 9, 2024

ReToCode commented Jan 10, 2024

xiangpingjiang commented Jan 12, 2024

ReToCode commented Jan 12, 2024

xiangpingjiang commented Jan 12, 2024

xiangpingjiang commented Jan 12, 2024

dprotaso commented Jan 14, 2024

ReToCode commented Jan 15, 2024

skonto commented Jan 15, 2024 •

edited

Loading

ReToCode commented Jan 15, 2024

Add Vegeta rates / targets to SLA in performance tests #14429

Add Vegeta rates / targets to SLA in performance tests #14429

Conversation

xiangpingjiang commented Sep 25, 2023

Proposed Changes

knative-prow bot commented Sep 25, 2023

dprotaso commented Sep 25, 2023

dprotaso commented Sep 25, 2023

codecov bot commented Sep 25, 2023 • edited Loading

Codecov Report

skonto commented Sep 25, 2023 • edited Loading

dprotaso commented Oct 17, 2023

dprotaso commented Oct 17, 2023

ReToCode left a comment

Choose a reason for hiding this comment

knative-prow bot commented Dec 13, 2023

xiangpingjiang commented Dec 13, 2023

skonto commented Dec 13, 2023

ReToCode commented Dec 14, 2023

skonto commented Dec 14, 2023 • edited Loading

ReToCode commented Jan 3, 2024

skonto commented Jan 9, 2024 • edited Loading

dprotaso commented Jan 9, 2024

ReToCode commented Jan 10, 2024

xiangpingjiang commented Jan 12, 2024

ReToCode commented Jan 12, 2024

xiangpingjiang commented Jan 12, 2024

xiangpingjiang commented Jan 12, 2024

dprotaso commented Jan 14, 2024

ReToCode commented Jan 15, 2024

skonto commented Jan 15, 2024 • edited Loading

ReToCode commented Jan 15, 2024

codecov bot commented Sep 25, 2023 •

edited

Loading

skonto commented Sep 25, 2023 •

edited

Loading

skonto commented Dec 14, 2023 •

edited

Loading

skonto commented Jan 9, 2024 •

edited

Loading

skonto commented Jan 15, 2024 •

edited

Loading