07 Introduction To Multicore Programming PDF
07 Introduction To Multicore Programming PDF
07 Introduction To Multicore Programming PDF
CS 4435 - CS 9624
1 Multi-core Architecture
Multi-core processor
CPU Cache
CPU Coherence
2 Concurrency Platforms
PThreads
TBB
Open MP
Cilk ++
Race Conditions and Cilkscreen
MMM in Cilk++
Plan
1 Multi-core Architecture
Multi-core processor
CPU Cache
CPU Coherence
2 Concurrency Platforms
PThreads
TBB
Open MP
Cilk ++
Race Conditions and Cilkscreen
MMM in Cilk++
Memory I/O
Network
$ $ … $
P P P
Multi-core processor
When the CPU needs to read or write a location, it checks the cache:
if it finds it there, we have a cache hit
if not, we have a cache miss and (in most cases) the processor needs to
create a new entry in the cache.
Making room for a new entry requires a replacement policy: the Least
Recently Used (LRU) discards the least recently used items first; this
requires to use age bits.
(Moreno Maza) Introduction to Multicore Programming CS 433 - CS 9624 11 / 60
Multi-core Architecture CPU Cache
Read latency (time to read a datum from the main memory) requires
to keep the CPU busy with something else:
out-of-order execution: attempt to execute independent instructions
arising after the instruction that is waiting due to the
cache miss
hyper-threading (HT): allows an alternate thread to use the CPU
Modifying data in the cache requires a write policy for updating the
main memory
- write-through cache: writes are immediately mirrored to main
memory
- write-back cache: the main memory is mirrored when that data is
evicted from the cache
The cache copy may become out-of-date or stale, if other processors
modify the original entry in the main memory.
(Moreno Maza) Introduction to Multicore Programming CS 433 - CS 9624 13 / 60
Multi-core Architecture CPU Cache
Cache Performance for SPEC CPU2000 by J.F. Cantin and M.D. Hill.
The SPEC CPU2000 suite is a collection of 26 compute-intensive, non-trivial
programs used to evaluate the performance of a computer’s CPU, memory
system, and compilers (https://www.spec.org/osg/cpu2000 ).
(Moreno Maza) Introduction to Multicore Programming CS 433 - CS 9624 15 / 60
Multi-core Architecture CPU Coherence
x=3
Load x x=3 …
P P P
Figure: Processor P1 reads x=3 first from the backing store (higher-level memory)
x=3
x=3
P P P
x=3
Store
x=5 x=3 x=3 … x=3
P P P
x=3
Store
x=5 x=3 x=5 … x=3
P P P
x=3
P P P
MSI Protocol
Advantages:
Cache coherency circuitry operate at higher rate than off-chip.
Reduced power consumption for a dual core vs two coupled single-core
processors (better quality communication signals, cache can be shared)
Challenges:
Adjustments to existing software (including OS) are required to
maximize performance
Production yields down (an Intel quad-core is in fact a double
dual-core)
Two processing cores sharing the same bus and memory bandwidth
may limit performances
High levels of false or true sharing and synchronization can easily
overwhelm the advantage of parallelism
Plan
1 Multi-core Architecture
Multi-core processor
CPU Cache
CPU Coherence
2 Concurrency Platforms
PThreads
TBB
Open MP
Cilk ++
Race Conditions and Cilkscreen
MMM in Cilk++
Concurrency Platforms
Fibonacci Execution
fib(4)
fib(3) fib(2)
fib(1) fib(0)
int fib(int n)
{
if (n < 2) return n;
Key idea for parallelization else {
Th calculations
The l l ti off fib(n-1)
fib( 1) int x = fib(n-1);
int y = fib(n-2);
and fib(n-2) can be return x + y;
executed simultaneously }
without mutual interference. }
PThreads
int pthread_create(
pthread_t *thread,
//returned identifier for the new thread
const pthread_attr_t *attr,
//object to set thread attributes (NULL for default)
void *(*func)(void *),
//routine executed after creation
void *arg
//a single argument passed to func
) //returns error status
int pthread_join (
pthread_t thread,
//identifier of thread to wait for
void **status
//terminating thread’s status (NULL to ignore)
) //returns error status
*WinAPI threads provide similar functionality.
PThreads
Scalability: Fibonacci code gets about 1.5 speedup for 2 cores for
computing fib(40).
Indeed the thread creation overhead is so large that only
one thread is used, see below.
Consequently, one needs to rewrite the code for more
than 2 cores.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int fib(int n)
{
if (n < 2) return n;
else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
}
typedef struct {
int input;
int output;
} thread_args;
TBB (1/2)
TBB (2/2)
Open MP
int fib(int n)
{
if (n < 2) return n;
int x, y;
#pragma omp task shared(x)
x = fib(n - 1);
#pragma omp task shared(y)
y = fib(n - 2);
#pragma omp taskwait
return x+y;
}
Cilk ++
int fib(int n)
{
if (n < 2) return n;
int x, y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
}
A AT
// indices run from 0, not 1
cilk_for (int i=1; i<n; ++i) {
for (int j=0; j<i; ++j) {
d bl temp = A[i][j];
double [i][j]
A[i][j] = A[j][i];
A[j][i] = temp;
}
}
Scheduling (1/3)
int fib (int n) {
if (n<2) return (n);
else {
int x,y;
x = cilk_spawn
ilk fib( 1)
fib(n-1);
y = fib(n-2);
cilk_sync;
return (x+y);
}
}
Memory I/O
Network
$
P
P
$
P
… $
P
Scheduling (2/3)
Cilk/Cilk++ randomized work-stealing scheduler load-balances the
computation at run-time. Each processor maintains a ready deque:
A ready deque is a double ended queue, where each entry is a
procedure instance that is ready to execute.
Adding a procedure instance to the bottom of the deque represents a
procedure call being spawned.
A procedure instance being deleted from the bottom of the deque
represents the processor beginning/resuming execution on that
procedure.
Deletion from the top of the deque corresponds to that procedure
instance being stolen.
A mathematical proof guarantees near-perfect linear speed-up on
applications with sufficient parallelism, as long as the architecture has
sufficient memory bandwidth.
A spawn/return in Cilk is over 100 times faster than a Pthread
create/exit and less than 3 times slower than an ordinary C
function call on a modern Intel processor.
(Moreno Maza) Introduction to Multicore Programming CS 433 - CS 9624 45 / 60
Concurrency Platforms Cilk ++
Scheduling (2/3)
4 Runtime
Conventional System Parallel
Regression Tests Regression Tests
Example A
int x = 0;
A
int
i t x = 0;
0
cilk_for(int i=0, i<2, ++i) {
B C x++; B x++; x++; C
}
D assert(x == 2);
assert(x == 2);
D
Dependency Graph
1 x = 0;
A
int x = 0; 2 4
r1 = x; r2 = x;
B x++; x++; C
3 5
r1++; r2++;
assert(x
( == 2);
2) 7 x = r1; 6 x = r2;
D
8 assert(x == 2);
1
?
0 1
?
0 0
?
1
r1 x r2
Does not scale up well due to a poor locality and uncontrolled granularity.