Long running erlang map/reduce can block view compaction from completion, leaking erlang procs #4725

KangTheTerrible · 2023-08-10T11:58:35Z

Description

A long running/slow erlang map/reduce due to a new shard deployment, appears to be blocking that shards view compaction from completing. It also appears to be leaking/growing erlang procs at a steady rate, between 5k-10k per hour.

Steps to Reproduce

Start view Compaction
Start long Erlang/reduce
View Compaction tries to complete, is unable to until indexer completes (suspected, waiting to observe this outcome)
Observe steady increase in erlang procs (may require continued insertion/interaction with the shard)

Expected Behaviour

View compaction should not be blocked
Erlang procs should not continue to increase until it hits the limit and crashes

Your Environment

AWS C6i.x32large 5 nodes q=3 n=5

CouchDB version used: 3.2.2
Operating system and version: Debian Buster

Additional Context

We resharded which resulted in the erlang map reduce being a lot longer than it should(not incremental).

KangTheTerrible · 2023-08-11T17:33:13Z

Additional piece of useful info, it seems that while the index is running for the first time I got this from the erlang views metadata, the leaking erlang procs appear to be the "clients waiting for the index".

_design/erlangstatsstats Metadata
Index Information
Language:Erlang
Currently being updated?Yes
Currently running compaction?Yes
Waiting for a commit?Yes
Clients waiting for the index:719422
Update sequence on DB:257926611
Processed purge sequence:0
Actual data size (bytes):602,563,809,246
Data size on disk (bytes):1,187,591,035,418
MD5 Signature:

KangTheTerrible · 2023-08-17T17:15:49Z

This does eventually resolve gracefully, given enough erlang procs and storage. Additional change that had to be made to keep on top of storage was to increase the view ratio smoosh concurrency values since stuck compactions prevented other compactions from running.

nickva · 2023-08-22T01:21:50Z

One strategy could be to periodically ping the https://docs.couchdb.org/en/stable/api/ddoc/common.html#db-design-design-doc-info endpoint and wait until the index has completed building before querying it to avoid piling up too many client requests if the index is large.

Using a larger Q (resharding) could also help parallelize indexing building if you have the computation and disk throughput resources.

KangTheTerrible · 2023-08-22T14:38:25Z

Yeah Nick, in our case unfortunately this was a live production server so we had no trivial means to block users from attempting to access the view. Worth noting, none of these clients were waiting, all view requests to this view are stable=false&update=lazy

rnewson · 2023-08-24T10:32:50Z

https://docs.couchdb.org/en/stable/best-practices/views.html#deploying-a-view-change-in-a-live-environment

fr2lancer · 2023-11-08T23:15:02Z

Hi actually I can't see any outstanding lines in debug mode in the log.
it just no logs from yesterday and process is unable to be recognized and top is 5.0.
Not consuming too much memory.

do you know how to flush debug from erlang?

nickva · 2023-12-06T04:24:38Z

I'll second @rnewson's proposal to try a old-ddoc/new-ddoc strategy to deploy new views.
For clients could use stable=false&update=false and let ken (index auto-builder) to build the indices for you in the background. Monitor with _active_tasks.
There is an undocumented [smoosh.ignore] $shard = true setting to allow the auto-compactor to ignore specific shards. For example:

[smoosh.ignore]
shards/e0000000-ffffffff/dbname.1660859921 = true

@fr2lancer if you're asking about debug logging for compaction/auto-compaction see issue compaction_log_level not working on Couchdb version 3.3.2 #4815 (comment). That's a bit tricky to set but it should work.
In your version of CouchDB 3.2.2 we had a bug calculating the slack and ratio and ended up triggering the auto-compactor too often. Consider upgrading to 3.3.3 if possible. You might find some of the compaction don't trigger as often any longer. That was fixed in 3.3.0 (Update active db size calculation to use only leaf nodes #4264)

KangTheTerrible added bug needs-triage labels Aug 10, 2023

KangTheTerrible changed the title ~~Long running erlang map/reduce can block compaction from completion, leaking erlang procs~~ Long running erlang map/reduce can block view compaction from completion, leaking erlang procs Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running erlang map/reduce can block view compaction from completion, leaking erlang procs #4725

Long running erlang map/reduce can block view compaction from completion, leaking erlang procs #4725

KangTheTerrible commented Aug 10, 2023 •

edited

Loading

KangTheTerrible commented Aug 11, 2023 •

edited

Loading

KangTheTerrible commented Aug 17, 2023

nickva commented Aug 22, 2023

KangTheTerrible commented Aug 22, 2023

rnewson commented Aug 24, 2023

fr2lancer commented Nov 8, 2023

nickva commented Dec 6, 2023

Long running erlang map/reduce can block view compaction from completion, leaking erlang procs #4725

Long running erlang map/reduce can block view compaction from completion, leaking erlang procs #4725

Comments

KangTheTerrible commented Aug 10, 2023 • edited Loading

Description

Steps to Reproduce

Expected Behaviour

Your Environment

Additional Context

KangTheTerrible commented Aug 11, 2023 • edited Loading

KangTheTerrible commented Aug 17, 2023

nickva commented Aug 22, 2023

KangTheTerrible commented Aug 22, 2023

rnewson commented Aug 24, 2023

fr2lancer commented Nov 8, 2023

nickva commented Dec 6, 2023

KangTheTerrible commented Aug 10, 2023 •

edited

Loading

KangTheTerrible commented Aug 11, 2023 •

edited

Loading