Flow.chunked operator with size limit #1290

azulkarnyaev · 2019-06-24T21:00:44Z

It would be useful to have an optional transformation in Flow.buffer method to aggregate the buffered items like in kotlin.sequences.Sequence.chunked.
I mean,
fun buffer(capacity: Int = BUFFERED, suspend transform: (List<T>) -> R) : Flow<T>

Then we can write

runBlocking {
        (1..100).asFlow().buffer(capacity = 10) { it.sum() }.collect { println(it) }
    }

with result 55, 155, 255, ... , 955

The text was updated successfully, but these errors were encountered:

elizarov · 2019-06-24T21:32:12Z

That would be a separate operator that we'll call chunked or something like that (as we generally follow stdlib naming convention). This operator will be totally unrelated to the buffer operator. Unlike buffer, this operator will be fully sequential.

P.S. Rx has a whole set of bufferXxx operators that actually correspond to chunked/windowed in Kotlin. On the other hand, buffer/conflate operators in Kotlin flows somewhat correspond to Rx onBackpressureBuffer operators.

elizarov · 2019-06-24T21:33:16Z

Also, I've forgot to ask what be your use-case for such an operator?

azulkarnyaev · 2019-06-25T08:22:50Z

Thank you very much for the response!
About the use case: I need to handle stream of vectors (csv row) from a tcp socket and write aggregated statistics for every chunk of n messages to files.

elizarov · 2019-06-25T10:35:14Z

Are you sure you need Flow for this? Wouldn't a Sequence from Kotlin standard library work for you?

azulkarnyaev · 2019-06-25T15:15:01Z

Well, I'm using Ktor as a server for socket connection and using coroutines to write data to a file. Pseudo code for my task:

launch {
    val socket = server.accept()
    val input = socket.openReadChannel()
    flow {
          while (true) {
                val line = input.readUTF8Line()
                emit(line)
           }
    }.map { 
          convertToDomainObject(it) 
    }. chunked(1000) {
          aggregateToDomainObjects(it)
    }.collect {
          writeToFile(it)
    }
}

Yes, I can use just sync Sequence. But then I need to provide a back pressure mechanism manually: what if I receive messages faster than I store them? Hopefully, chunked() method will be the same as buffer() and provide a way for back pressure out of the box.

elizarov · 2019-06-25T20:19:48Z

@azulkarnyaev Thanks for explanation. It does make sense.

circusmagnus · 2019-06-30T22:42:25Z

I would second this issue - I have a use case, where I'm receiving subsequent snapshots of a database and I need to produce classes representing diffs between those snapshots.

So I need to cache two subsequent emissions, emit them as Pair(first, second), remove the first emission, and wait for the third emission to emit another Pair(second, third). With collections, I do get it with windowed function.

I do not control frequency of emissions and they must happen on a background thread -> hence Flow is needed.

I think it could be generalized into something like this:

fun <T> Flow<T>.windowed(size: Int, step: Int): Flow<List<T>> = flow {
    // check that size and step are > 0
    val queue = ArrayDeque<T>(size)
    val toSkip = max(step - size, 0) < if sbd would like to skip some elements before getting another window, by serving step greater than size, then why not?
    val toRemove = min(step, size)
    var skipped = 0
    collect { element ->
        if(queue.size < size && skipped == toSkip) {
            queue.add(element)
        }
        else if (queue.size < size && skipped < toSkip) {
        skipped++
    }

        if(queue.size == size) {
            emit(queue.toList())
            repeat(toRemove) { queue.remove() }
            skipped = 0
        }
    }
}

Intended use:

 flow.windowed(size = 2, step = 1)
    .map { listOfTwoNeighboringEmissions ->
        computeDiff()
}

zach-klippenstein · 2019-07-01T16:53:17Z

@circusmagnus That sounds like a use case for scan more than windowing.

circusmagnus · 2019-07-02T05:06:23Z

Almost. Scan requires me to either provide an initial value, which I do not have (an empty diff is stupid, I just need to swallow first dB emission and wait for the second one to produce a diff) or to emit the same type, as I am receiving (scanReduce), which is also a no go, as I get a list of entities, but want to emit changes between them.

tunjid · 2020-08-06T22:58:52Z

Would this work?

            flow
                .scan(listOf<Item>()) { oldItems, newItem ->
                    if (oldItems.size >= BUFFER_COUNT) listOf(newItem)
                    else oldItems + newItem
                }
                .filter { it.size ==  BUFFER_COUNT}

circusmagnus · 2020-08-07T07:19:38Z

On the first look it should work. I would recommend however to use more efficient and streamlined operator outlined in this PR: #1558
flow.chunked(2) { twoEmissions -> combine(twoEmissions) }

It is not going into coroutines lib, as it does not deal with time-based chunking / windowing. But for now it is best solution for size-based chunking.

AWinterman · 2021-02-08T23:46:40Z

Hi all, I'm curious if there's been any work on this. I reach for something like this about once every two weeks, and keep coming back to this thread.

circusmagnus · 2021-02-09T05:52:36Z

I have proposed a design for unified time- and size-based chunking in #1302 . You are welcome to comment or just give thumbs up (or down).
No idea, what is the plan of coroutines team, regarding this issue, though.

AWinterman · 2021-02-09T20:58:14Z

I'm realizing that I actually want a slightly different behavior from what I've seen discussed thus far, because really all I want is to be able to convert a stream of values to a batch operation when appropriate.

something like the following:

/**
 * [chunked] buffers a maximum of [maxSize] elements, preferring to emit early rather than wait if less than
 * [maxSize]
 *
 * If [checkIntervalMillis] is specified, chunkedNaturally suspends [checkIntervalMillis] to allow the buffer to fill.
 *
 * TODO: move to kotlin common
 */

fun <T> Flow<T>.chunked(maxSize: Int, checkIntervalMillis: Long = 0): Flow<List<T>>

This is optimizing for a database that performs better with batch operations than a number of small ones, and for which it's safest to restrict writes to once a second.

My implementation is here. I'm sure I'm using coroutines incorrectly somehow:
https://gist.github.com/AWinterman/8516d4869f491176ebb270dafbb23199

circusmagnus · 2021-02-09T21:50:32Z

Seems that your chunked operator will suspend after filling up buffer (max size reached) but it will not emit until checkIntervalMillis is reached. checkIntervalMillis is a must-have condition for it to emit.

Is it intentional?

AWinterman · 2021-02-09T21:57:26Z

@circusmagnus I'm not sure I follow.

suspending after filling up the buffer is intentional. If the buffer is full, we need to exert backpressure on the upstream flow.
the delay(checkIntervalMillis) ensures that we do not busy wait, and that the buffer has a chance to fill up before we collect a chunk and emit.

I don't know what you mean by "checkIntervalMillis is a must-have condition for it to emit."

AWinterman · 2021-02-09T22:50:25Z

Ah, i just realized that the delay interval can be accomplished downstream

.transform {
    emit(it)
    delay(100)
}

which makes my usecase wholy subsumed by Flow.chunked operator with size limit.

circusmagnus · 2021-02-09T23:27:26Z

@circusmagnus I'm not sure I follow.

suspending after filling up the buffer is intentional. If the buffer is full, we need to exert backpressure on the upstream flow.

the delay(checkIntervalMillis) ensures that we do not busy wait, and that the buffer has a chance to fill up before we collect a chunk and emit.

I don't know what you mean by "checkIntervalMillis is a must-have condition for it to emit."

If the buffer is full, we could try to emit, rather than suspend upstream until checkIntervalMillis is reached. Perhaps downstream is idle and it can accept a new chunk before checkIntervalMillis comes. In your impl downstream cannot emit more often, than checkIntervalMillis specifies. There is non-circumnavigable delay() there.

while (!buffer.isClosedForReceive) {
                    val chunk = getChunk(buffer, maxSize)
                    [email protected](chunk)
                    delay(checkIntervalMillis) <- we cannot emit more often, than that
                }

Sure.checkIntervalMillis is a must have condition to emit, but maxSize is not - we can emit before reaching max size, but we cannot emit more often, than checkIntervalMillis says. Was it intentional? Do You need to limit frequency of emissions in your use-case?

AWinterman · 2021-02-10T00:10:07Z

@circusmagnus I'm not sure I follow.

suspending after filling up the buffer is intentional. If the buffer is full, we need to exert backpressure on the upstream flow.

the delay(checkIntervalMillis) ensures that we do not busy wait, and that the buffer has a chance to fill up before we collect a chunk and emit.

I don't know what you mean by "checkIntervalMillis is a must-have condition for it to emit."

If the buffer is full, we could try to emit, rather than suspend upstream until checkIntervalMillis is reached. Perhaps downstream is idle and it can accept a new chunk before checkIntervalMillis comes. In your impl downstream cannot emit more often, than checkIntervalMillis specifies. There is non-circumnavigable delay() there.
while (!buffer.isClosedForReceive) {
                    val chunk = getChunk(buffer, maxSize)
                    [email protected](chunk)
                    delay(checkIntervalMillis) <- we cannot emit more often, than that
                }
Sure.checkIntervalMillis is a must have condition to emit, but maxSize is not - we can emit before reaching max size, but we cannot emit more often, than checkIntervalMillis says. Was it intentional? Do You need to limit frequency of emissions in your use-case?

ah, yes, it was intentionally. Emit no more frequently than X, but as I stated above, because flows are composeable, this can be accomplished with a downstream flow operation, so my use case is entirely satisfied by a "Flow.chunked operator with size limit"

sskrla · 2022-11-09T19:57:53Z

Another possible implementation that we currently use:

/**
 * Chunks based on a time or size threshold.
 *
 * Borrowed from this [Stack Overflow question](https://stackoverflow.com/questions/51022533/kotlin-chunk-sequence-based-on-size-and-time).
 */
@OptIn(ObsoleteCoroutinesApi::class) 
fun <T> ReceiveChannel<T>.chunked(scope: CoroutineScope, size: Int, time: Duration) =
    scope.produce<List<T>> {
        while (true) { // this loop goes over each chunk
            val chunk = ConcurrentLinkedQueue<T>() // current chunk
            val ticker = ticker(time.toMillis()) // time-limit for this chunk
            try {
                whileSelect {
                    ticker.onReceive {
                        false  // done with chunk when timer ticks, takes priority over received elements
                    }
                    [email protected] {
                        chunk += it
                        chunk.size < size // continue whileSelect if chunk is not full
                    }
                }

            } catch (e: ClosedReceiveChannelException) {
                return@produce

            } finally {
                ticker.cancel()
                if (chunk.isNotEmpty())
                    send(chunk.toList())
            }
        }
    }

fun <T> Flow<T>.chunked(size: Int, time: Duration) =
    channelFlow {
        coroutineScope {
            val channel = asChannel(this@chunked).chunked(this, size, time)
            try {
                while (!channel.isClosedForReceive) {
                    send(channel.receive())
                }

            } catch(e: ClosedReceiveChannelException) {
                // Channel was closed by the flow completing, nothing to do

            } catch(e: CancellationException) {
                channel.cancel(e)
                throw e

            } catch (e: Exception) {
                channel.cancel(CancellationException("Closing channel due to flow exception", e))
                throw e
            }
        }
    }

@ExperimentalCoroutinesApi 
fun <T> CoroutineScope.asChannel(flow: Flow<T>): ReceiveChannel<T> = produce {
    flow.collect { value ->
        channel.send(value)
    }
}

I am not certain this is entirely correct, specifically launching a scope within the channelFlow seems like it may not be a very "flowy".

Our specific use case is batch with a max linger, so that we can attempt to make efficient external service calls, but not introduce too much delay if we can't fill up the batch.

Related issues: - Kotlin/kotlinx.coroutines#1290 - Kotlin/kotlinx.coroutines#902 - Kotlin/kotlinx.coroutines#1558 - Kotlin/kotlinx.coroutines#2378

iseki0 · 2024-04-18T03:55:45Z

Time has passed quickly, and it’s already been five years since this issue was first raised. Could you please provide an update on when we can expect a resolution?

Fixes #1290

elizarov added the flow label Jun 24, 2019

elizarov changed the title ~~Flow.buffer with transformer~~ Flow.chunked operator Jun 24, 2019

elizarov mentioned this issue Jul 1, 2019

Flow.chunked with specified time period #1302

Open

elizarov changed the title ~~Flow.chunked operator~~ Flow.chunked operator with size limit Jul 1, 2019

circusmagnus mentioned this issue Sep 20, 2019

Add chunked and windowed operators #1558

Closed

circusmagnus mentioned this issue Nov 10, 2020

Flow size- and time-based chunked #2378

Open

pacher mentioned this issue Nov 16, 2020

Support natural batching in flow #902

Open

Dogacel mentioned this issue Oct 14, 2023

Add time windowed chunking, groupedWithin hoc081098/FlowExt#185

Open

qwwdfsad added a commit that referenced this issue May 10, 2024

Introduce basic Flow<T>.chunked operator

d2cb27d

Fixes #1290

qwwdfsad mentioned this issue May 10, 2024

Introduce basic Flow<T>.chunked operator #4127

Merged

dkhalanskyjb closed this as completed in c9c735a May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow.chunked operator with size limit #1290

Flow.chunked operator with size limit #1290

azulkarnyaev commented Jun 24, 2019 •

edited

Loading

elizarov commented Jun 24, 2019 •

edited

Loading

elizarov commented Jun 24, 2019

azulkarnyaev commented Jun 25, 2019

elizarov commented Jun 25, 2019

azulkarnyaev commented Jun 25, 2019

elizarov commented Jun 25, 2019

circusmagnus commented Jun 30, 2019 •

edited

Loading

zach-klippenstein commented Jul 1, 2019

circusmagnus commented Jul 2, 2019

tunjid commented Aug 6, 2020

circusmagnus commented Aug 7, 2020

AWinterman commented Feb 8, 2021

circusmagnus commented Feb 9, 2021

AWinterman commented Feb 9, 2021 •

edited

Loading

circusmagnus commented Feb 9, 2021

AWinterman commented Feb 9, 2021 •

edited

Loading

AWinterman commented Feb 9, 2021 •

edited

Loading

circusmagnus commented Feb 9, 2021 •

edited

Loading

AWinterman commented Feb 10, 2021

sskrla commented Nov 9, 2022

iseki0 commented Apr 18, 2024

Flow.chunked operator with size limit #1290

Flow.chunked operator with size limit #1290

Comments

azulkarnyaev commented Jun 24, 2019 • edited Loading

elizarov commented Jun 24, 2019 • edited Loading

elizarov commented Jun 24, 2019

azulkarnyaev commented Jun 25, 2019

elizarov commented Jun 25, 2019

azulkarnyaev commented Jun 25, 2019

elizarov commented Jun 25, 2019

circusmagnus commented Jun 30, 2019 • edited Loading

zach-klippenstein commented Jul 1, 2019

circusmagnus commented Jul 2, 2019

tunjid commented Aug 6, 2020

circusmagnus commented Aug 7, 2020

AWinterman commented Feb 8, 2021

circusmagnus commented Feb 9, 2021

AWinterman commented Feb 9, 2021 • edited Loading

circusmagnus commented Feb 9, 2021

AWinterman commented Feb 9, 2021 • edited Loading

AWinterman commented Feb 9, 2021 • edited Loading

circusmagnus commented Feb 9, 2021 • edited Loading

AWinterman commented Feb 10, 2021

sskrla commented Nov 9, 2022

iseki0 commented Apr 18, 2024

azulkarnyaev commented Jun 24, 2019 •

edited

Loading

elizarov commented Jun 24, 2019 •

edited

Loading

circusmagnus commented Jun 30, 2019 •

edited

Loading

AWinterman commented Feb 9, 2021 •

edited

Loading

AWinterman commented Feb 9, 2021 •

edited

Loading

AWinterman commented Feb 9, 2021 •

edited

Loading

circusmagnus commented Feb 9, 2021 •

edited

Loading