Multicore Computing: Cores, Threads, And Parallelism

Data Parallelism

Why Multicore?

In the multithreading environment, achieving high level performance is the main reason to move into the multicore.
To overcome the complexity of the software development, it should be managed by the specialized knowledge with training.
Managing the hardware parallelism is the main issue in achieving scalable performance. SO the multi-core platform alleviates the complexity and eliminates the burdens in the manual thread management process.

Cores and Threads

Core is an essential part of the computer processor. It helps to execute the instructions.
There are two different type of threads. A software thread represents a stream of instructions which are given to the processor for execution. The hardware thread represents the hardware resources such as CPU. It is used to execute the software thread.

Data Parallelism

In data parallelism, the specified operation needs to be applied on each element of the given data set. This operation can be performed in parallel.

Same data set is taken for various tasks which are independent each other.

The pipeline execution can be done on both task and data parallelism. A stream of data is taken and multiple independent tasks are applied on it. Throughout several stages, the data element is passed and we can pass multiple data set items on different stages at the same time.

To improve the proportion of the speedup to the total number of processors, the parallel system’s capacity is measured and it is known as the scalability.

The speedup of a program using multiple processors in parallel computing is limited by the time needed for the serial fraction of the problem.

If the problem of size W has a serial component Ws, the Speedup of the program is

Assume that Ws=20% and W-Ws=80% then,

Thus the Amdahl’s Law implies that the parallel computing is only useful when the total number of processors is very small such as 5.

It helps to overcome the insufficient problem. Because, it parallelized with efficient speedup.

Where p represents the total number of processors and α represents the serial portion of the problem. It overcomes the larger problem sizes by improved speedup.

Execution time of program on a parallel computer is (a+b)

a is the sequential time and b is the parallel time
Total amount of work to be done in parallel varies linearly

with the number of processors. So b is fixed as p is varied. The total run time is (a + p*b)

The speedup is (a+p*b)/(a+b)
Define α = a/(a+b) , the sequential fraction of the execution

time, then

A scalable parallel system can always be made cost?optimal by adjusting the number of processors and the problem size.

Incorrectness is not a big issue in sequential program. But in parallel programming, the precise order of operations potentially be different. So it leads to non-deterministic behavior of the result such as round off errors, deadlocks and race conditions.

A mechanism which is used to control the concurrent accessibility of resources.

Race conditions

While many tasks read from and write to same memory address, the race condition will be occurred. The main reason for race condition is improper synchronization.

MPI

MPI represents the Message Passing Interface which is mainly used for parallel computing. MPI is the combination of protocol and semantic features.

Broadcast

A data can be sent to all processes through communicator while it was holding on one process.

Task Parallelism

Eg:

void Get input(

int my_rank,

int comm_sz,

double* a_p,

double* b_p,

int* n_p ) {

if (my_rank == 0) { /* Process 0 reads an user input and sends

printf(“Enter a, b, and nn”); it to all other processes */

scanf(“%lf %lf %d”, a p, b p, n p); }

MPI_Bcast(a_p, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

MPI_Bcast(b_p, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

MPI_Bcast(n_p, 1, MPI_INT, 0, MPI_COMM_WORLD); }

Tags and Wildcards

MPI_ANY_SOURCE – It matches any sources with the receiver at end. It helps to avoid unnecessary waiting.
MPI_ANY_TAG – It matches any tag of the sender to the end receiver

MPI_Reduce

The lengthy code can be placed by a single invocation. The syntax is given below.

MPI Reduce(&local_int, &total_int, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

Example:

int MPI Reduce(

void* input_data_p /* in */,

void* output_data_p /* out */,

int count /* in */,

MPI_Datatype datatype /* in */,

MPI_Op operator /* in */,

int dest process /* in */,

MPI_Comm comm /* in */

);

MPI_Scatter

This functions reads the entire process but delivers the exact required components of each processes. The syntax is given below.

int MPI Scatter(

void* send_buf_p, /* pointer to data to divide */

int send_count, /* number of elements */

MPI_Datatype send_type,

void* recv_buf p, /*pointer to a local vector */

int recv_count, /* local_n (size of local vector*/

MPI_Datatype recv_type,

int src_proc, /* source of the data */

MPI_Comm comm );

MPI_gather

Scattering the vector is not sufficient for MPI communication. Gathering the results also necessary and it can be acheived by using MPI_Gather. The syntax is given below.

int MPI_Gather(

void* send_buf_p , /* data to send */

int send_count ,

MPI_Datatype send_type,

void* recv_buf_p, /* data to receive/gather */

int recv count,

MPI_Datatype recv_type,

int dest_proc,

MPI_Comm comm );

Data consolidation

From the multiple data sources, data is integrated, put together in a single source is called data consolidation.

MPI_Barrier

The barrier represents the synchronization point in the program. The will be suspended until all threads in a parallel region reach the barrier. The suspended threads will be resumed after reaching the barrier point. The main purpose is to improve the correctness of the program by reducing the data races.

OpenMP

Main directives

Directives used with end directive pair. The syntax is given below.

!$OMP directive

[ structured block of code ]

!$OMP end directive

The format of c/c++ directives format is given below.

Example: #pragma omp parallel

Task based programming

While the thread catches the construction of task, a new task will be generated from the instruction code within a structured block. Based on the data sharing attributes, the data environment is created. After encountering each threads, the execution will be started. The syntax is given below.

Scalability

#pragma omp task [clause, clause, …]

structured-block

Mutual Exclusion

It is a mechanism for maintaining the concurrent process of threads while accessing the same resources.

Example:

P0: flag[0] = true; | P1: flag[1] = true;

turn = 1; | turn = 0;

while (flag[1] == true | while (flag[0] == true

&& turn == 1) | && turn==0)

{ // busy wait } | {//busy wait}

// critical section … | //critical section

// end of critical section | //end of critical section

flag[0] = false; | flag[1]==false;

P0 and P1 can’t be at the same time in critical section!

Locks

To control the accessibility of the threads ‘Locking’ is preferred. To get an exclusive accessibility to a variable in the data structure, the lock is required. These locks are necessary for ensuring the correct behavior of the multiple-threads and programs.

Eg:

omp_lock_t writelock;

omp_init_lock(&writelock);

#pragma omp parallel for for ( i = 0; i < x; i++ )

{ // some stuff

omp_set_lock(&writelock);

// one thread at a time stuff

omp_unset_lock(&writelock);

// some stuff }

omp_destroy_lock(&writelock);

Memory Consistency

To obtain better efficiency, the processor can rearrange the memory reads and writes. This rearrangement of memory can cause some problems and it is called as weak memory consistency. The main issues in due to the memory consistency is that the algorithms and programs will become not executable while they running on the weak memory consistency models. To overcome this problem of incorrectness for weak consistency models, the ‘Paterson Algorithm’ is used as given below. Consider P0 and P1 are processes which are concurrently executing.

Parallel for and data dependencies

OpenMP parallelize the for loops but neither do-while nor while loops. The syntax is given below.

#pragma omp parallel for [clauses]

for_statement

// execute for_statement in parallel

The computation of iterations will be based on the previous iterations. It is known as data dependencies or loop carried dependencies.

Example:

fibo[0] = fibo[1] = 1;

# pragma omp parallel for num_threads(thread_count)

for (i = 2; i < n; i++)

fibo[i] = fibo[i-1] + fibo[i-2];

Parallel sorting

In computation process, sorting plays an important role in classical tasks. It rearranges the numerical value in a sequential order.

Hybrid Programming:MPI+OpenMP

In MPI, the distributed memory systems, resources are not shared in an optimal way and the communication is expensive. To overcome these issues we go for hybrid MPI+OpenMP. It improves the performance in various aspects such as grain size and communication cost.

Amdahl’s Law

Techniques for performance improvement

To improve the performance of parallel programs the following aspects need to be considered.

Degree of parallelization available
Grain size
Locality

Generally, the grain size represents the size of work given to thread, process or processor. The large data packet is split into many small chunks of data. These count of chunks is called as grain size. The performance of parallel system depends on the parallel programs. So that the size of data chunks will affect the parallel performance.

Two different types of locality available which are given below.

Temporal locality

When a processor accesses the memory location of a variable, it can revisit the memory location quickly and without consuming more timing.

Data locality

When a processor accesses, the memory location, it can visit the nearby location immediately.

The locality helps in the execution of loops indexing through arrays. The locality makes the cache memory very useful.

Parallel overhead

Parallel overhead is an issue and it will occur in parallel execution because of bad locality of non-local interfering writes and reads. This issue can be overcome by reduction of parallel overhead in three ways which are given below.

Loop scheduling

To improve the performance it is necessary to use different scheduling on various test cases.

Conditionally executing in parallel

It helps to reduce parallelism overhead for small values of arguments.

Replicating work

To make the thread interaction faster, the threads need to replicate work. It is faster than barrier synchronization.

Loop transformations

In the parallelizable transformations, loop constructions are very essential. To get better performance, in multicore architectures, the loops need to be transformed. Types of loop transformations are listed below.

Loop fission
Loop fusion
Loop inversion

For MPI programs debugging process, serial debuggers are used. To separate processes, the serial debuggers are used as the attachment of gdb. In parallelized process other debuggers are used such as ‘TotalView’.

gdb allows the programmer to see what is going on inside another program when its execution. By using GDB, we can run the Microsoft windows and Unix variants. Basic gdb commands are given below.

run -To execute the program
Ctrl-C -To stop execution
continue -To continue execution
list -It shows where the program stopped
quit It quits debugger

VTune for profiling parallel programs

The performance for Parallel Programs van be depends up on various factors and they are listed below.

Available hardware and software resources;
Communication cost;
Parallel overhead related to threads control;
Optimizations;

Advantages and limitations

It has multi-core processing units and thousands of parallel execution units on a single card. It has faster memory interfaces; but it has some limitations on the parallelism operations such as data parallelism. Because the same instructions were applied on same data set.

In GPU execution model, GPU executes the kernel function. In this model an array of threads executed with same code but on different paths. Each thread has an unique ID for controlling decisions. These threads are grouped into blocks and these blockes are grouped together into a grid.

Mutual Exclusion

It is used with specialized library functions, compiler directives such as OpenACC and specialized languages such as CUDA, language extensions and OpenCL.

OpenACC

OpenACC represents the open using accelerators including GPUs. The OpenACC uses directives in C, C++ code for helping the compiler to offload the chosen computations into accelerators. The OpenACC programming is same as OpenMP programming.

Example:

void saxpy(int n, float a, float *x, float *y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

CUDA

CUDA is a parallel computing platform and this programming model was proposed by NVIDIA Company. The NVIDIA CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The CUDA Toolkit: a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing;

The CUDA programming model consists of both the CPU and GPU are used; in this model, the host represents the CPU and its memory; The device represents the GPU and its memory;

The code can run on the host can manage memory on both the host and device. The code runs on the host launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

Example:

for (int i = 0; i < N; i++) //initialization of host arrays

{ x[i] = 1.0f; y[i] = 2.0f; }

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

//copying the content of host arrays to device arrays

Parallelization of prefix operations

Assume that we have x0, x1, x2,…. a sequence of integers (in an array); We need to compute

y0 = x0
y1 = x0 + x1
y2 = x0 + x1+ x2

Here we need to use Parallel algorithm for prefix sum

Compute the sums of consecutive pairs

z0 = x0 + x1, z1 = x2 + x3, etc

Compute prefix sum of z0, z1, z2…:

w0 = z0 = x0 + x1

w1 =z0 + z1 = x0 + x1 + x2 + x3

Compute required prefix sum from x and w:

y0 = x0, y1 = w0, y2 = w0 + x2, y3 = w1, …

Compute the sums of consecutive pairs

z0 = x0 + x1, z1 = x2 + x3, etc

Compute prefix sum of z0, z1, z2…:

w0 = z0 = x0 + x1

w1 =z0 + z1 = x0 + x1 + x2 + x3

Compute required prefix sum from x and w:

y0 = x0, y1 = w0, y2 = w0 + x2, y3 = w1, .

Apply the recursive algorithm to find the z sequence in parallel

It is the ability to run the parallel program sequentially. It is possible in parallel as well. So that it is called as relaxed sequential. To get appropriate result, without parallel execution process, we go for RSE model. In RSE model debugging and verification process can be done easily.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Multicore Computing: Cores, Threads, And Parallelism ”

Get high-quality paper

NEW! AI matching with writer