-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate libtuv thread pool to eliminate thread creation overhead #93
Conversation
The commit messages shall contain relevant experiment results for reference purpose. |
For performance measurement, you should take C++11 threads, affinity and hyperthreading into considerations. Thread affinity might differentiate dramatically. |
How about C and OpenCL implementations? |
90e6dc3
to
97613e3
Compare
The experiment result has been added/updated to the git commit and the first comment of this issue.
I think this can be opened as another issue?
C implementation does include thread pool of libtuv. |
Yes, please do. The skeleton implementation looks like the following: static int num_processors;
int main(int argc, char *argv[]) {
...
num_processors = sysconf(_SC_NPROCESSORS_CONF);
...
}
#include <sched.h>
static inline void drop_policy(void) {
struct sched_param param = { .sched_priority = 0; };
sched_setscheduler(0, SCHED_OTHER, ¶m);
}
static inline void affine_to_cpu(int id, int cpu) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
sched_setaffinity(0, sizeof(set), &set);
}
static void *worker_thread(void *userdata) {
int thread_id = ((thread_info *) userdata)->id;
...
/* Set worker threads to nice 19 and then preferentially to SCHED_IDLE
* and if that fails, then SCHED_BATCH. No need for this to be an
* error if it fails.
*/
if (!geteuid())
setpriority(PRIO_PROCESS, 0, -14);
drop_policy();
/* Cpu affinity only makes sense if the number of threads is a multiple
* of the number of CPUs.
*/
affine_to_cpu(thread_id, thread_id % num_processors);
...
} |
(off-topic) Is |
Experiment result shall come with the listing of hardware configurations for reference purpose. |
Since this pull request dramatically changes the flow of execution, there should be a dedicated note briefing fundamental designs in directory |
Rebasing is required due to recent document re-organization. |
To reduce the overhead of creating and eliminating the threads repeatedly, we integrate the thread pool of libtuv with git submodule. The pthread-related functions and data types are replaced with the corresonding ones of libtuv. The compilation of libtuv library is written in the file mk/submodule.mk. Experiment: Call clock_gettime() right before and after the functions for getting the thread. The functions are pthread_create() (without thread pool) and uv_queue_work() (with thread pool). Use test-multi-pow.py as testcase since it initializes and destroys dcurl only once and does the PoW multiple times, like what IRI does. The experiment result shows the time of getting each thread and the thread number of a PoW execution is 7. Hardware information: architecure - x86_64 CPU - AMD Ryzen 5 2400G (4 cores/8 threads) Experiment result (unit: second): Without thread pool thread0: 0.000028384 thread1: 0.000025127 thread2: 0.000024748 thread3: 0.000023925 thread4: 0.000024126 thread5: 0.000025328 thread6: 0.000052900 thread0: 0.000049344 thread1: 0.000039575 thread2: 0.000036720 thread3: 0.000036249 thread4: 0.000034606 thread5: 0.000034676 thread6: 0.000033444 With thread pool thread0: 0.000124327 thread1: 0.000002084 thread2: 0.000001052 thread3: 0.000000150 thread4: 0.000000121 thread5: 0.000000080 thread6: 0.000000090 thread0: 0.000000291 thread1: 0.000000080 thread2: 0.000000050 thread3: 0.000000050 thread4: 0.000000050 thread5: 0.000000060 thread6: 0.000000050 The first consumed time of getting the thread from thread pool is longer since it is in charge of preallocating and initalizing the threads. Close DLTcollab#58.
97613e3
to
728aa2a
Compare
@@ -0,0 +1,21 @@ | |||
# Copy from the Makefile of libtuv to support different platforms | |||
UNAME_M := $(shell uname -m) | |||
UNAME_S := $(shell uname -s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add FIXME
to mention the limitation of supported operating system listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm......
I'm not sure what you expect to see.
Like listing the operating system that dcurl
supports but libtuv
does not or vice versa?
I have added the architecture and CPU information in the commit message.
I will record it as a TO-DO list.
This has been done without any problem. |
It is necessary. |
To reduce the overhead of creating and eliminating the threads repeatedly,
we integrate the thread pool of libtuv with git submodule.
The pthread-related functions and data types are replaced with the corresonding
ones of libtuv.
The compilation of libtuv library is written in the file mk/submodule.mk.
Experiment:
Call clock_gettime() right before and after the functions for getting the thread.
The functions are pthread_create() (without thread pool)
and uv_queue_work() (with thread pool).
Use test-multi-pow.py as testcase since it initializes and destroys dcurl only once and
does the PoW multiple times, like what IRI does.
The experiment result shows the time of getting each thread
and the thread number of a PoW execution is 7.
Experiment result (unit: second):
Without thread pool
thread0: 0.000028384
thread1: 0.000025127
thread2: 0.000024748
thread3: 0.000023925
thread4: 0.000024126
thread5: 0.000025328
thread6: 0.000052900
thread0: 0.000049344
thread1: 0.000039575
thread2: 0.000036720
thread3: 0.000036249
thread4: 0.000034606
thread5: 0.000034676
thread6: 0.000033444
With thread pool
thread0: 0.000124327
thread1: 0.000002084
thread2: 0.000001052
thread3: 0.000000150
thread4: 0.000000121
thread5: 0.000000080
thread6: 0.000000090
thread0: 0.000000291
thread1: 0.000000080
thread2: 0.000000050
thread3: 0.000000050
thread4: 0.000000050
thread5: 0.000000060
thread6: 0.000000050
The first consumed time of getting the thread from thread pool is longer
since it is in charge of preallocating and initalizing the threads.
Close #58.