Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added moodycamel/concurrentqueue as the default task queue provider f… #192

Merged
merged 1 commit into from
Jan 14, 2016

Conversation

glglwty
Copy link
Collaborator

@glglwty glglwty commented Jan 13, 2016

…or fastrun. Fixed unnecessary seq_cst sync in task::task_state. added benchmark for task_queue. Set default sim_net delay to 0.

about the queue

moodycamel/concurrentqueue is a fast MPMC queue, released under BSD licence. It is much faster than naive implementations, including boost and TBB implementations. Its implementation is complicated and I don't think there is a way to hand-craft a queue that is comparable to it.

performance change

I tested the performance of hpc_priority_task_queue and the newly added hpc_concurrent_task_queue using core.task_queue_perf_test. Here's their performance:

hpc_concurrent_task_queue

inter-thread flooding test:throughput = 6672467
self-flooding test:throughput = 7226603
inter-thread blocking test:throughput = 1046366
self-iterating test:throughput = 4041782
tick-tock test:throughput = 2700396

hpc_priority_task_queue

inter-thread flooding test:throughput = 5152068
self-flooding test:throughput = 7956709
inter-thread blocking test:throughput = 100560
self-iterating test:throughput = 6062134
tick-tock test:throughput = 266755

We can see that hpc_concurrent_task_queue is much faster under contention or partially-idle
workloads.

discussion about performance

The performance of our task_queues is not ideal. I did profiling on the benchmarks and the current most prominent overhead lies within the free procedure of tasks. Maybe memory pool of frequently used objects is needed.

discussion about atomic variables

It seems that many of us aren't aware of the proper usage of std::atomic and memory_order. Here is some rule of thumb:

  1. the default memory order of atomic variables is std::memory_order_seq_cst. It has high overhead (~100 cycles on x86) and is hardly necessary. Actually a well-behaved spinlock does not need seq_cst synchronization.
  2. Acquire-release pairs is (and should be!) widely used to ensure safety.
  3. Consume-release pairs might be a faster version of acquire-release if you are aware of data dependency around atomic operations.

I recommend this article for reference: https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

broken assumptions

default sim_net delay is set to 0 to avoid confusion
the default task_queue factory is hpc_concurrent_task_queue now

misc

@imzhenyu you can update imzhenyu/concurrentqueue from upstream

@imzhenyu
Copy link
Owner

Very nice summary. Thanks, Tianyi. Now that we are working on performance tuning, it is strongly recommended that everyone reads the article that Tianyi suggests above. Meanwhile, let's write down all the items we have optimized. Later on we can come up a list to avoid making the same mistakes in the future (I myself made certain mistakes). Furthermore, we may discuss setting up some performance regression test next.

@imzhenyu
Copy link
Owner

imzhenyu/concurrentqueue updated as suggested.

{
succ = true;
finish = true;
}
else
{
task_state old_state = _state.load();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot imagine this bug exists here for so long and we did not find it:)

@qinzuoyan
Copy link
Collaborator

It's really a great work~

@imzhenyu
Copy link
Owner

There seems certain overhead when enqueue and dequeue are in the same thread (33% slowdown for self-iterating test). It is possible we can have a local queue for local enqueue operations when the queue is private (when the thread pool is partitioned).

@glglwty glglwty force-pushed the master branch 6 times, most recently from 0c31cb1 to 581ea4d Compare January 14, 2016 03:32
…or fastrun. Fixed unnecessary seq_cst sync in task::task_state. added benchmark for task_queue.
imzhenyu added a commit that referenced this pull request Jan 14, 2016
added moodycamel/concurrentqueue as the default task queue provider f…
@imzhenyu imzhenyu merged commit 7b657bf into imzhenyu:master Jan 14, 2016
imzhenyu added a commit that referenced this pull request Dec 2, 2016
some improvements from Xiaomi team
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants