Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoroutineScheduler rework #1652

Merged
merged 8 commits into from
Dec 12, 2019
Merged

CoroutineScheduler rework #1652

merged 8 commits into from
Dec 12, 2019

Conversation

qwwdfsad
Copy link
Collaborator

@qwwdfsad qwwdfsad commented Nov 8, 2019

Fixes #840
Fixes #1046
Fixes #1286

@qwwdfsad qwwdfsad force-pushed the scheduler-changes branch 2 times, most recently from dd7ab91 to dd30af0 Compare November 26, 2019 17:21
@elizarov
Copy link
Contributor

elizarov commented Nov 27, 2019

Does not seem that we need @Volatile on state:


It is only read in other threads after Thread.join and in toString()

@qwwdfsad qwwdfsad marked this pull request as ready for review November 27, 2019 11:47
@qwwdfsad qwwdfsad changed the title [DRAFT] Scheduler changes Scheduler changes Nov 27, 2019
Copy link
Contributor

@elizarov elizarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. All the feedback I can give is mostly cosmetic.

@qwwdfsad qwwdfsad changed the title Scheduler changes CoroutineScheduler rework Nov 28, 2019
@elizarov
Copy link
Contributor

Note, a bug was introduced somewhere in shutdown sequence. It can hang on shutdown now: https://teamcity.jetbrains.com/buildConfiguration/KotlinTools_KotlinxCoroutines_NightlyStressWindows/2652563

org.junit.runners.model.TestTimedOutException: test timed out after 60000 milliseconds
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1260)
	at kotlinx.coroutines.scheduling.CoroutineScheduler.shutdown(CoroutineScheduler.kt:346)
	at kotlinx.coroutines.scheduling.ExperimentalCoroutineDispatcher.shutdown$kotlinx_coroutines_core(Dispatcher.kt:125)
	at kotlinx.coroutines.TestBase.shutdownPoolsAfterTest(TestBase.kt:173)
	at kotlinx.coroutines.TestBase.onCompletion(TestBase.kt:150)
	at kotlinx.coroutines.debug.DebugTestBase.tearDown(DebugTestBase.kt:29)

@qwwdfsad
Copy link
Collaborator Author

qwwdfsad commented Dec 2, 2019

I was avoiding adding long-scanning (dd30af0) because it could have masked all the bugs on the rendezvous of parking/unparking. Ironically, it has exposed existing problems :) It seems like it is impossible to ensure strict liveness in loads like BlockingCoroutineDispatcherMixedStealingStressTest.. I can't say this is a "bug", but rather something we should be aware of before merge.
I will elaborate on details later, for now just will fix an execution that leads to hanging

  • 1 core, 2 threads (one with CPU token), both ends scanning queues before pushing themselves to the stack.
  • External submitter submits CPU task to the global queue, can't find anyone to unpark, can't create a new thread, bails out
  • Thread without CPU token pushes itself to the stack
  • Thread with CPU token pushes itself to the stack
  • Thread with CPU token finds external CPU tasks and starts executing it
  • Thread without token parks
  • External submitter submits blocking tasks, tries to compensate workers, but can't, pops the stack, founds the thread with CPU token (that is executing CPU task right now), successfully "unparks" it, bails out
  • CPU task never completes (due to the nature of the test, it can't complete before blocking task), blocking task is kept in the queue, blocking worker is parked forever

@elizarov
Copy link
Contributor

elizarov commented Dec 10, 2019

@qwwdfsad
....

  • External submitter submits blocking tasks, tries to compensate workers, but can't, pops the stack, founds the thread with CPU token (that is executing CPU task right now), successfully "unparks" it, bails out

Hm. If the thread is executing CPU task right now, then it should not be unparked by tryUnpark as worker.tryForbidTermination() should return false if the thread was claimed before, because terminationState.value = TERMINATION_ALLOWED assignment is only done in the actual worker.park() method after CPU task finished executing.

Combining both terminationState and parkingState to a single state-machine that is reset after parking should be able to fix this problem.

Copy link
Contributor

@elizarov elizarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks way better now.

    * WorkQueue.trySteal reports not only whether the steal was successful, but also a waiting time unless task becomes stealable
    * CoroutineScheduler.trySteal attempts to steal from all the workers (starting from the random position) per iteration to have deterministic stealing
    * Parking mechanics rework. After unsuccessful findTask, worker immediately adds itself to parking stack, then rescans all the queues to avoid missing tryUnparks and only then parks itself (parking duration depends on WorkQueue.trySteal result), terminating later
    * Excessive spinning and parking is completely eliminated, significantly (x3) reducing CPU-consumption and making CoroutineScheduler on-par with FJP and FTP on Ktor-like workloads
    * Downside of aggressive parking is a cost of slow-path unpark payed by external submitters that can be shown in degraded DispatchersContextSwitchBenchmark. Follow-up commits will fix that problem
    * Retry on tryStealLastScheduled failures to avoid potential starvation
    * Merge available CPU permits with controlState to simplify reasoning about pool state and make all state transitions atomic
    * Get rid of synthetic accessors
    * Work stealing: get rid of global queue for offloading during stealing because it never happens in fact
    * Guard all critical invariants related to work-stealing with asserts
    * New work signalling strategy that guarantees complete liveness in the face of "accidentally-blocking" CPU tasks
    * Advanced double-phase unparking mechanism that mitigates the most expensive part of signalling an additional work
    * Current limitation: blocking tasks are not yet properly signalled
Invariants:

    * Steal only one task per attempt to avoid missing steals that potentially may block the progress (check-park-check may miss tasks that are being stolen)
    * New WorkQueue.add invariant: bufferSize < capacity => add is always successful
    * Re-visited tests that expose a lot of problems
    * Ability to steal from the middle of work queue in order to steal blocking tasks with ABA prevention

Changes:

    * Instead of "blocking workers" use "blocking tasks" state that is incremented on each blocking submit and decrement only when task is completed
    * On each work signalling try to compensate blocking tasks by enforcinf invariant "created threads == blocking tasks + up to core size workers"
    * Now if worker was not spuriously woken up, it has a task dedicated for him that should be found. For that reason attempt to steal blocking tasks
      (that may be in the middle of the work queue). Additionally, instead of scanning the whole global queue, just split it in two (one for blocking, one for regular tasks)
    * Get rid of conditional remove from the global queue
    * Avoid excessive unparks for threads that are not yet able to steal the task due to workstealing resolution: do not add such workers to the stack
    * Do not push worker to the stack during second pass of "min duration scanning"
    * Locally cache whether local queue has any work to execute to save atomic getAndSet and a bunch of atomic loads
    * More precise rendezvous for parking
    * Long-scanning stealing to emulate spinning
    * Properly handle interference of termination sequence and protection against spurious wakeups
    * Documentation improvements, naming, proper spurious wakeup check
@qwwdfsad qwwdfsad merged commit cbd5b1c into develop Dec 12, 2019
@qwwdfsad qwwdfsad deleted the scheduler-changes branch February 13, 2020 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants