Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC resolver may cause TiKV OOM #15412

Closed
Tracked by #16375
overvenus opened this issue Aug 23, 2023 · 0 comments · Fixed by #15486
Closed
Tracked by #16375

CDC resolver may cause TiKV OOM #15412

overvenus opened this issue Aug 23, 2023 · 0 comments · Fixed by #15486
Assignees
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 affects-7.1 affects-7.2 affects-7.3 severity/major type/bug The issue is confirmed as a bug.

Comments

@overvenus
Copy link
Member

overvenus commented Aug 23, 2023

Bug Report

Similar to #14864, CDC has its own resolver, and it may cause OOM too.

What version of TiKV are you using?

> 4.0.0

Steps to reproduce

Run a large txn that is much greater than TiKV memory size.

What did you expect?

No OOM.

What did happened?

OOM.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 5, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 5, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 5, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 5, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 5, 2023
ti-chi-bot bot added a commit that referenced this issue Sep 7, 2023
ref #15412

MemoryQuota alloc API returns result, make it more ergonomic.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 7, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 7, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 7, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 7, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Sep 7, 2023
overvenus added a commit to ti-chi-bot/tikv that referenced this issue Nov 27, 2023
close tikv#15412

Similar to resolved-ts endpoint, cdc endpoint maintains resolvers for
subscribed regions. These resolvers also need memory quota, otherwise
they may cause OOM.
This commit lets cdc endpoint deregister regions if they exceed
memory quota.

Signed-off-by: Neil Shen <[email protected]>
overvenus added a commit to ti-chi-bot/tikv that referenced this issue Nov 27, 2023
ref tikv#15412

MemoryQuota alloc API returns result, make it more ergonomic.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
overvenus added a commit to ti-chi-bot/tikv that referenced this issue Dec 8, 2023
close tikv#15412

Similar to resolved-ts endpoint, cdc endpoint maintains resolvers for
subscribed regions. These resolvers also need memory quota, otherwise
they may cause OOM.
This commit lets cdc endpoint deregister regions if they exceed
memory quota.

Signed-off-by: Neil Shen <[email protected]>
overvenus added a commit to ti-chi-bot/tikv that referenced this issue Dec 8, 2023
ref tikv#15412

MemoryQuota alloc API returns result, make it more ergonomic.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue Dec 10, 2023
 #15523 #15554 (#15465)

close #14864, ref #14864, ref #15412, close #15412, close #15553

This commit rolls up following patches:

*: add memory quota to resolved_ts::Resolver (#15400)

ref #14864

This is the first PR to fix OOM caused by Resolver tracking large txns.
Resolver checks memory quota before tracking a lock, and returns false
if it exceeds memory quota.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: Neil Shen <[email protected]>
---
resolved_ts: re-register region if memory quota exceeded  (#15411)

close #14864

Fix resolved ts OOM caused by Resolver tracking large txns. `ObserveRegion` is
deregistered if it exceeds memory quota. It may cause higher CPU usage because
of scanning locks, but it's better than OOM.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
---
resolved_ts: track pending lock memory usage (#15452)

ref #14864

* Fix resolved ts OOM caused by adding large txns locks to `ResolverStatus`.
* Add initial scan backoff duration metrics.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Co-authored-by: Connor <[email protected]>
---
cdc: deregister delegate if memory quota exceeded (#15486)

close #15412

Similar to resolved-ts endpoint, cdc endpoint maintains resolvers for
subscribed regions. These resolvers also need memory quota, otherwise
they may cause OOM.
This commit lets cdc endpoint deregister regions if they exceed
memory quota.

Signed-off-by: Neil Shen <[email protected]>
---
*: let alloc API return result (#15529)

ref #15412

MemoryQuota alloc API returns result, make it more ergonomic.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
---
resolved_ts: limit scanner memory usage (#15523)

ref #14864

* Break resolved ts scan entry into multiple tasks.
* Limit concurrent resolved ts scan tasks.
* Remove resolved ts dead code.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
---
resolved_ts: remove hash set to save memory (#15554)

close #15553

The Resolver uses a hash set to keep track of locks associated with
the same timestamp. When the length of the hash set reaches zero,
it indicates that the transaction has been fully committed. To save
memory, we can replace the hash set with an integer.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: Neil Shen <[email protected]>

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: Neil Shen <[email protected]>
ti-chi-bot bot pushed a commit that referenced this issue Dec 27, 2023
 #15523 and #15554 (#15464)

close #14864, ref #14864, close #15412, ref #15412, close #15553

This commit rolls up following patches:

*: add memory quota to resolved_ts::Resolver (#15400)

ref #14864

This is the first PR to fix OOM caused by Resolver tracking large txns.
Resolver checks memory quota before tracking a lock, and returns false
if it exceeds memory quota.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: Neil Shen <[email protected]>
---
resolved_ts: re-register region if memory quota exceeded  (#15411)

close #14864

Fix resolved ts OOM caused by Resolver tracking large txns. `ObserveRegion` is
deregistered if it exceeds memory quota. It may cause higher CPU usage because
of scanning locks, but it's better than OOM.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
---
resolved_ts: track pending lock memory usage (#15452)

ref #14864

* Fix resolved ts OOM caused by adding large txns locks to `ResolverStatus`.
* Add initial scan backoff duration metrics.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Co-authored-by: Connor <[email protected]>
---
cdc: deregister delegate if memory quota exceeded (#15486)

close #15412

Similar to resolved-ts endpoint, cdc endpoint maintains resolvers for
subscribed regions. These resolvers also need memory quota, otherwise
they may cause OOM.
This commit lets cdc endpoint deregister regions if they exceed
memory quota.

Signed-off-by: Neil Shen <[email protected]>
---
*: let alloc API return result (#15529)

ref #15412

MemoryQuota alloc API returns result, make it more ergonomic.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
---
resolved_ts: limit scanner memory usage (#15523)

ref #14864

* Break resolved ts scan entry into multiple tasks.
* Limit concurrent resolved ts scan tasks.
* Remove resolved ts dead code.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
---
resolved_ts: remove hash set to save memory (#15554)

close #15553

The Resolver uses a hash set to keep track of locks associated with
the same timestamp. When the length of the hash set reaches zero,
it indicates that the transaction has been fully committed. To save
memory, we can replace the hash set with an integer.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: Neil Shen <[email protected]>

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: Neil Shen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 affects-7.1 affects-7.2 affects-7.3 severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants