Why Multicore?
Cores and Threads
Data Parallelism
In data parallelism, the specified operation needs to be applied on each element of the given data set. This operation can be performed in parallel.
Same data set is taken for various tasks which are independent each other.
The pipeline execution can be done on both task and data parallelism. A stream of data is taken and multiple independent tasks are applied on it. Throughout several stages, the data element is passed and we can pass multiple data set items on different stages at the same time.
To improve the proportion of the speedup to the total number of processors, the parallel system’s capacity is measured and it is known as the scalability.
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the serial fraction of the problem.
If the problem of size W has a serial component Ws, the Speedup of the program is
Assume that Ws=20% and W-Ws=80% then,
Thus the Amdahl’s Law implies that the parallel computing is only useful when the total number of processors is very small such as 5.
It helps to overcome the insufficient problem. Because, it parallelized with efficient speedup.
Where p represents the total number of processors and α represents the serial portion of the problem. It overcomes the larger problem sizes by improved speedup.
Execution time of program on a parallel computer is (a+b)
with the number of processors. So b is fixed as p is varied. The total run time is (a + p*b)
time, then
A scalable parallel system can always be made cost?optimal by adjusting the number of processors and the problem size.
Incorrectness is not a big issue in sequential program. But in parallel programming, the precise order of operations potentially be different. So it leads to non-deterministic behavior of the result such as round off errors, deadlocks and race conditions.
A mechanism which is used to control the concurrent accessibility of resources.
Race conditions
While many tasks read from and write to same memory address, the race condition will be occurred. The main reason for race condition is improper synchronization.
MPI
MPI represents the Message Passing Interface which is mainly used for parallel computing. MPI is the combination of protocol and semantic features.
Broadcast
A data can be sent to all processes through communicator while it was holding on one process.
Eg:
void Get input(
int my_rank,
int comm_sz,
double* a_p,
double* b_p,
int* n_p ) {
if (my_rank == 0) { /* Process 0 reads an user input and sends
printf(“Enter a, b, and nn”); it to all other processes */
scanf(“%lf %lf %d”, a p, b p, n p); }
MPI_Bcast(a_p, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b_p, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(n_p, 1, MPI_INT, 0, MPI_COMM_WORLD); }
Tags and Wildcards
MPI_Reduce
The lengthy code can be placed by a single invocation. The syntax is given below.
MPI Reduce(&local_int, &total_int, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
Example:
int MPI Reduce(
void* input_data_p /* in */,
void* output_data_p /* out */,
int count /* in */,
MPI_Datatype datatype /* in */,
MPI_Op operator /* in */,
int dest process /* in */,
MPI_Comm comm /* in */
);
MPI_Scatter
This functions reads the entire process but delivers the exact required components of each processes. The syntax is given below.
int MPI Scatter(
void* send_buf_p, /* pointer to data to divide */
int send_count, /* number of elements */
MPI_Datatype send_type,
void* recv_buf p, /*pointer to a local vector */
int recv_count, /* local_n (size of local vector*/
MPI_Datatype recv_type,
int src_proc, /* source of the data */
MPI_Comm comm );
MPI_gather
Scattering the vector is not sufficient for MPI communication. Gathering the results also necessary and it can be acheived by using MPI_Gather. The syntax is given below.
int MPI_Gather(
void* send_buf_p , /* data to send */
int send_count ,
MPI_Datatype send_type,
void* recv_buf_p, /* data to receive/gather */
int recv count,
MPI_Datatype recv_type,
int dest_proc,
MPI_Comm comm );
Data consolidation
From the multiple data sources, data is integrated, put together in a single source is called data consolidation.
MPI_Barrier
The barrier represents the synchronization point in the program. The will be suspended until all threads in a parallel region reach the barrier. The suspended threads will be resumed after reaching the barrier point. The main purpose is to improve the correctness of the program by reducing the data races.
OpenMP
Main directives
Directives used with end directive pair. The syntax is given below.
!$OMP directive
[ structured block of code ]
!$OMP end directive
The format of c/c++ directives format is given below.
Example: #pragma omp parallel
Task based programming
While the thread catches the construction of task, a new task will be generated from the instruction code within a structured block. Based on the data sharing attributes, the data environment is created. After encountering each threads, the execution will be started. The syntax is given below.
#pragma omp task [clause, clause, …]
structured-block
Mutual Exclusion
It is a mechanism for maintaining the concurrent process of threads while accessing the same resources.
Example:
P0: flag[0] = true; | P1: flag[1] = true;
turn = 1; | turn = 0;
while (flag[1] == true | while (flag[0] == true
&& turn == 1) | && turn==0)
{ // busy wait } | {//busy wait}
// critical section … | //critical section
// end of critical section | //end of critical section
flag[0] = false; | flag[1]==false;
P0 and P1 can’t be at the same time in critical section!
Locks
To control the accessibility of the threads ‘Locking’ is preferred. To get an exclusive accessibility to a variable in the data structure, the lock is required. These locks are necessary for ensuring the correct behavior of the multiple-threads and programs.
Eg:
omp_lock_t writelock;
omp_init_lock(&writelock);
#pragma omp parallel for for ( i = 0; i < x; i++ )
{ // some stuff
omp_set_lock(&writelock);
// one thread at a time stuff
omp_unset_lock(&writelock);
// some stuff }
omp_destroy_lock(&writelock);
Memory Consistency
To obtain better efficiency, the processor can rearrange the memory reads and writes. This rearrangement of memory can cause some problems and it is called as weak memory consistency. The main issues in due to the memory consistency is that the algorithms and programs will become not executable while they running on the weak memory consistency models. To overcome this problem of incorrectness for weak consistency models, the ‘Paterson Algorithm’ is used as given below. Consider P0 and P1 are processes which are concurrently executing.
Parallel for and data dependencies
OpenMP parallelize the for loops but neither do-while nor while loops. The syntax is given below.
#pragma omp parallel for [clauses]
for_statement
// execute for_statement in parallel
The computation of iterations will be based on the previous iterations. It is known as data dependencies or loop carried dependencies.
Example:
fibo[0] = fibo[1] = 1;
# pragma omp parallel for num_threads(thread_count)
for (i = 2; i < n; i++)
fibo[i] = fibo[i-1] + fibo[i-2];
Parallel sorting
In computation process, sorting plays an important role in classical tasks. It rearranges the numerical value in a sequential order.
Hybrid Programming:MPI+OpenMP
In MPI, the distributed memory systems, resources are not shared in an optimal way and the communication is expensive. To overcome these issues we go for hybrid MPI+OpenMP. It improves the performance in various aspects such as grain size and communication cost.
Techniques for performance improvement
To improve the performance of parallel programs the following aspects need to be considered.
Generally, the grain size represents the size of work given to thread, process or processor. The large data packet is split into many small chunks of data. These count of chunks is called as grain size. The performance of parallel system depends on the parallel programs. So that the size of data chunks will affect the parallel performance.
Two different types of locality available which are given below.
When a processor accesses the memory location of a variable, it can revisit the memory location quickly and without consuming more timing.
When a processor accesses, the memory location, it can visit the nearby location immediately.
The locality helps in the execution of loops indexing through arrays. The locality makes the cache memory very useful.
Parallel overhead
Parallel overhead is an issue and it will occur in parallel execution because of bad locality of non-local interfering writes and reads. This issue can be overcome by reduction of parallel overhead in three ways which are given below.
To improve the performance it is necessary to use different scheduling on various test cases.
It helps to reduce parallelism overhead for small values of arguments.
To make the thread interaction faster, the threads need to replicate work. It is faster than barrier synchronization.
Loop transformations
In the parallelizable transformations, loop constructions are very essential. To get better performance, in multicore architectures, the loops need to be transformed. Types of loop transformations are listed below.
For MPI programs debugging process, serial debuggers are used. To separate processes, the serial debuggers are used as the attachment of gdb. In parallelized process other debuggers are used such as ‘TotalView’.
gdb allows the programmer to see what is going on inside another program when its execution. By using GDB, we can run the Microsoft windows and Unix variants. Basic gdb commands are given below.
VTune for profiling parallel programs
The performance for Parallel Programs van be depends up on various factors and they are listed below.
Advantages and limitations
It has multi-core processing units and thousands of parallel execution units on a single card. It has faster memory interfaces; but it has some limitations on the parallelism operations such as data parallelism. Because the same instructions were applied on same data set.
In GPU execution model, GPU executes the kernel function. In this model an array of threads executed with same code but on different paths. Each thread has an unique ID for controlling decisions. These threads are grouped into blocks and these blockes are grouped together into a grid.
It is used with specialized library functions, compiler directives such as OpenACC and specialized languages such as CUDA, language extensions and OpenCL.
OpenACC
OpenACC represents the open using accelerators including GPUs. The OpenACC uses directives in C, C++ code for helping the compiler to offload the chosen computations into accelerators. The OpenACC programming is same as OpenMP programming.
Example:
void saxpy(int n, float a, float *x, float *y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
CUDA
CUDA is a parallel computing platform and this programming model was proposed by NVIDIA Company. The NVIDIA CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The CUDA Toolkit: a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing;
The CUDA programming model consists of both the CPU and GPU are used; in this model, the host represents the CPU and its memory; The device represents the GPU and its memory;
The code can run on the host can manage memory on both the host and device. The code runs on the host launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Example:
for (int i = 0; i < N; i++) //initialization of host arrays
{ x[i] = 1.0f; y[i] = 2.0f; }
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
//copying the content of host arrays to device arrays
Parallelization of prefix operations
Assume that we have x0, x1, x2,…. a sequence of integers (in an array); We need to compute
Here we need to use Parallel algorithm for prefix sum
z0 = x0 + x1, z1 = x2 + x3, etc
w0 = z0 = x0 + x1
w1 =z0 + z1 = x0 + x1 + x2 + x3
y0 = x0, y1 = w0, y2 = w0 + x2, y3 = w1, …
z0 = x0 + x1, z1 = x2 + x3, etc
w0 = z0 = x0 + x1
w1 =z0 + z1 = x0 + x1 + x2 + x3
y0 = x0, y1 = w0, y2 = w0 + x2, y3 = w1, .
It is the ability to run the parallel program sequentially. It is possible in parallel as well. So that it is called as relaxed sequential. To get appropriate result, without parallel execution process, we go for RSE model. In RSE model debugging and verification process can be done easily.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download