Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(executor): simply drop the futures to cancel a query #720

Merged
merged 3 commits into from
Nov 2, 2022

Conversation

wangrunji0408
Copy link
Member

This PR proposes a new way to do query cancellation: simply drop the futures.

In this solution, each executor task should be aware that they may be dropped at any time, which means the query is being canceled. To fit this pattern, some behavior related to drop has to be changed:

  • DataChunkBuilders are allowed to drop with remaining rows.
  • Transactions are allowed to drop without explicit commit or abort, and drop will implicitly abort the transaction.
  • All spawned futures have to be canceled by dropping their JoinHandle, or exit actively by closing a channel.

Compared with the current solution #467, it doesn't use a cancellation token to inform the future and wait for it to exit, which significantly simplifies the code. But due to the lack of support for async-drop, it's not trivial to perform async operations in the cancellation. A possible solution is to spawn the async function or block the thread in the drop function.
This PR almost reverts #467, but retains its functionality. @arkbriar Please take a look and see if this looks good to you. 👀

@arkbriar
Copy link
Contributor

arkbriar commented Nov 1, 2022

@wangrunji0408 I think the PR looks good as long as the transaction can be aborted without async operations. I see it aims to provide the abort-safety. But I'm not sure if that is always true🤔.

The cancellation token might be a feasible solution in all cases where we don't have to assume that no async operations would happen when cancellation happens. But I agree without async-drop, the code would be a mess and both hard to read and write.

Besides, I have some questions about the pattern:

All spawned futures have to be canceled by dropping their JoinHandle.

Does it work? I don't think dropping a JoinHandle would cancel or trigger any kind of cancellation to the spawned future. It just drops the handle that tells the task status.

A possible solution is to spawn the async function or block the thread in the drop function.

IMO, spawning another future is useful only when there's no extra requirement for cancellation progress, e.g., the order of processing. And blocking the thread should never be considered because it can block all of the working threads and result in unexpected halts or performance degradations which are really hard to reproduce and debug.

@wangrunji0408
Copy link
Member Author

I think there should be a way to abort transactions without async operations, or to leave async cleanups running in the background and don't wait for them to finish. Because thinking of the worst case, an instant crash is also considered abort to the ongoing transaction. So a simple drop should also be safe for the abort.

All spawned futures have to be canceled by dropping their JoinHandle.

Does it work? I don't think dropping a JoinHandle would cancel or trigger any kind of cancellation to the spawned future. It just drops the handle that tells the task status.

You're right. The tokio JoinHandle doesn't cancel the task on drop, but another runtime smol does. In fact, the latter design is more reasonable from the perspective of structured concurrency. Anyway, we can wrap the JoinHandle with a new type and make it cancel on drop.

IMO, spawning another future is useful only when there's no requirement for cancellation progress. And blocking the thread should never be considered because it can block all of the working threads and result in unexpected halts or performance degradations which are really hard to reproduce and debug.

If there's a requirement to wait for the async cancellation, we can maintain a context for each task, let the task spawn async operations into it, and finally join them all, as you did before. But I believe the context can be made as an implicit global state, which is less intrusive than the current implementation. This can be seen as a workaround for the lack of async-drop. And I agree with you that blocking is a bad idea. 🥵

@arkbriar
Copy link
Contributor

arkbriar commented Nov 1, 2022

@wangrunji0408 Cool! Thanks for letting me know that there are runtimes support cancellations on drop.

I think there should be a way to abort transactions without async operations or to leave async cleanups running in the background and don't wait for them to finish. Because thinking of the worst case, an instant crash is also considered abort to the ongoing transaction. So a simple drop should also be safe for the abort.

Restarting after a crash often requires an recovery progress to check and repair the data for consistency, which I don't think we can also have when interrupting a query. But I agree that we can avoid that with a careful design and by having something like the background vacuum thread.

Anyways, it generally sounds good to me! Let me go through the code in detail.

Copy link
Contributor

@arkbriar arkbriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@arkbriar
Copy link
Contributor

arkbriar commented Nov 1, 2022

Sorry, I just remembered one of the reasons that I chose to use the cancellation token. One possible disadvantage would be that the cancellation is propagated with the channel and can only be detected when sending to / receiving from channels, which may cause a waste of computation:

for {
    event = process().await?;           // This is not cancellable,
    if tx.send(event).await.is_err() {  // so abort here.
        return;
    }
}

It isn't a problem now. But would be a problem when the process contains some really CPU-intensive and non-interruptible workloads (such as HashJoins which pull all the data up and do the computation before emitting any result). In those cases, I would prefer to propagate the cancellation to the leaf operators and stop them immediately.

                            - TableScan
                           /
Filter -- HashJoin(spawned) 
                           \
                            - TableScan

WDYT? @wangrunji0408 @skyzh

@wangrunji0408
Copy link
Member Author

I've inserted some yield points into the join executors, so that they won't take the CPU for a long time. With careful implementation, the cancellation can be propagated top-down by cascade drops. I've tested canceling a join query in this PR and it seems to work well.

@arkbriar
Copy link
Contributor

arkbriar commented Nov 1, 2022

I've inserted some yield points into the join executors, so that they won't take the CPU for a long time. With careful implementation, the cancellation can be propagated top-down by cascade drops. I've tested canceling a join query in this PR and it seems to work well.

No, I mean the spawned tasks aren't aware of the cancellation until they operate with the channel, so yielding points doesn't help. They just make the tasks yield control to the runtime. Currently, there's no intra-parallelism inside a physical plan, so it is expected to work. But the CPU usage of it is bounded by 1 core. I believe in the future we will need to support such intra-parallelism and at that time we will not be able to tell which executor is spawned and which is not.

@wangrunji0408
Copy link
Member Author

Oh, the spawned task will be canceled by dropping its JoinHandle from the parent task, though it still requires the spawned task to yield (equivalent to checking the cancellation token), otherwise the task can not stop running. As long as these 2 conditions are met, the cancellation can be propagated, no matter whether it's on single-thread or multi-thread.

@arkbriar
Copy link
Contributor

arkbriar commented Nov 1, 2022

Oh, the spawned task will be canceled by dropping its JoinHandle from the parent task, though it still requires the spawned task to yield (equivalent to checking the cancellation token), otherwise the task can not stop running. As long as these 2 conditions are met, the cancellation can be propagated, no matter whether it's on single-thread or multi-thread.

I see. Interesting 🤔

@wangrunji0408 wangrunji0408 enabled auto-merge (squash) November 2, 2022 02:46
@wangrunji0408 wangrunji0408 merged commit c440ee4 into main Nov 2, 2022
@wangrunji0408 wangrunji0408 deleted the wrj/cancel branch November 2, 2022 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants