added moodycamel/concurrentqueue as the default task queue provider f… #192

glglwty · 2016-01-13T12:32:01Z

…or fastrun. Fixed unnecessary seq_cst sync in task::task_state. added benchmark for task_queue. Set default sim_net delay to 0.

about the queue

moodycamel/concurrentqueue is a fast MPMC queue, released under BSD licence. It is much faster than naive implementations, including boost and TBB implementations. Its implementation is complicated and I don't think there is a way to hand-craft a queue that is comparable to it.

performance change

I tested the performance of hpc_priority_task_queue and the newly added hpc_concurrent_task_queue using core.task_queue_perf_test. Here's their performance:

hpc_concurrent_task_queue

inter-thread flooding test:throughput = 6672467
self-flooding test:throughput = 7226603
inter-thread blocking test:throughput = 1046366
self-iterating test:throughput = 4041782
tick-tock test:throughput = 2700396

hpc_priority_task_queue

inter-thread flooding test:throughput = 5152068
self-flooding test:throughput = 7956709
inter-thread blocking test:throughput = 100560
self-iterating test:throughput = 6062134
tick-tock test:throughput = 266755

We can see that hpc_concurrent_task_queue is much faster under contention or partially-idle
workloads.

discussion about performance

The performance of our task_queues is not ideal. I did profiling on the benchmarks and the current most prominent overhead lies within the free procedure of tasks. Maybe memory pool of frequently used objects is needed.

discussion about atomic variables

It seems that many of us aren't aware of the proper usage of std::atomic and memory_order. Here is some rule of thumb:

the default memory order of atomic variables is std::memory_order_seq_cst. It has high overhead (~100 cycles on x86) and is hardly necessary. Actually a well-behaved spinlock does not need seq_cst synchronization.
Acquire-release pairs is (and should be!) widely used to ensure safety.
Consume-release pairs might be a faster version of acquire-release if you are aware of data dependency around atomic operations.

I recommend this article for reference: https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

broken assumptions

default sim_net delay is set to 0 to avoid confusion
the default task_queue factory is hpc_concurrent_task_queue now

misc

@imzhenyu you can update imzhenyu/concurrentqueue from upstream

imzhenyu · 2016-01-14T01:28:53Z

Very nice summary. Thanks, Tianyi. Now that we are working on performance tuning, it is strongly recommended that everyone reads the article that Tianyi suggests above. Meanwhile, let's write down all the items we have optimized. Later on we can come up a list to avoid making the same mistakes in the future (I myself made certain mistakes). Furthermore, we may discuss setting up some performance regression test next.

imzhenyu · 2016-01-14T01:35:35Z

imzhenyu/concurrentqueue updated as suggested.

imzhenyu · 2016-01-14T01:38:32Z

src/core/core/task.cpp

 {
 succ = true;
 finish = true;
 }
 else
 {
- task_state old_state = _state.load();


cannot imagine this bug exists here for so long and we did not find it:)

qinzuoyan · 2016-01-14T01:57:41Z

It's really a great work~

imzhenyu · 2016-01-14T02:03:52Z

There seems certain overhead when enqueue and dequeue are in the same thread (33% slowdown for self-iterating test). It is possible we can have a local queue for local enqueue operations when the queue is private (when the thread pool is partitioned).

…or fastrun. Fixed unnecessary seq_cst sync in task::task_state. added benchmark for task_queue.

added moodycamel/concurrentqueue as the default task queue provider f…

some improvements from Xiaomi team

imzhenyu reviewed Jan 14, 2016
View reviewed changes

glglwty force-pushed the master branch 6 times, most recently from 0c31cb1 to 581ea4d Compare January 14, 2016 03:32

added moodycamel/concurrentqueue as the default task queue provider f…

cf41277

…or fastrun. Fixed unnecessary seq_cst sync in task::task_state. added benchmark for task_queue.

glglwty force-pushed the master branch from 581ea4d to cf41277 Compare January 14, 2016 04:32

imzhenyu added a commit that referenced this pull request Jan 14, 2016

Merge pull request #192 from glglwty/master

7b657bf

added moodycamel/concurrentqueue as the default task queue provider f…

imzhenyu merged commit 7b657bf into imzhenyu:master Jan 14, 2016

imzhenyu added a commit that referenced this pull request Dec 2, 2016

Merge pull request #192 from qinzuoyan/master

22bb835

some improvements from Xiaomi team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added moodycamel/concurrentqueue as the default task queue provider f… #192

added moodycamel/concurrentqueue as the default task queue provider f… #192

glglwty commented Jan 13, 2016

imzhenyu commented Jan 14, 2016

imzhenyu commented Jan 14, 2016

imzhenyu Jan 14, 2016

qinzuoyan commented Jan 14, 2016

imzhenyu commented Jan 14, 2016

added moodycamel/concurrentqueue as the default task queue provider f… #192

added moodycamel/concurrentqueue as the default task queue provider f… #192

Conversation

glglwty commented Jan 13, 2016

about the queue

performance change

hpc_concurrent_task_queue

hpc_priority_task_queue

discussion about performance

discussion about atomic variables

broken assumptions

misc

imzhenyu commented Jan 14, 2016

imzhenyu commented Jan 14, 2016

imzhenyu Jan 14, 2016

Choose a reason for hiding this comment

qinzuoyan commented Jan 14, 2016

imzhenyu commented Jan 14, 2016