There were times when programmers did not have to worry about parallelisation, as most computers only had a single CPU. Sure, you could get dual-CPU mainboards, but their use was mostly limited to servers. The need for parallelisation was confined to the world of high performance computing, and that’s where I first got in touch with it while working on my master’s thesis, writing software for gravity field computations on the 16-CPU NEC Itanium and NEC SX-6 vector processor systems at Stuttgart’s High Performance Computing Center.
This changed with the introduction of the Pentium D in 2005. With the Gigahertz race coming to an end due to physical constraints, AMD and Intel needed other ways to get more speed out of their CPUs. Combining multiple CPUs (cores) into one physical chip made this possible, if software was modified to actually make use of this. Today every PC CPU has multiple cores, with four being pretty much the minimum, and some high-end CPUs having hundreds.
So, how to make use of this processing power? In order for a single task to make use of multiple cores, it needs to be parallelized. The easiest way to achieve this in C/C++ is by using OpenMP’s parallel for pragma. For this to work OpenMP needs to be supported by the compiler and enabled by setting the proper compiler switch.
Parallel for
Being able to use parallel for requires a for loop for iterating over your data set, as is common in data processing. The preprocessor statement will instruct the compiler to generate code that can be executed in parallel. These will be run as separate threads, not processes.
#pragma omp parallel for for(int i=0; i<n; i++) { result[i] = do_something_with(data[i]); }
By default the for loop should then be distributed over all available cores. This includes the virtual cores that result from hyper-threading, i.e. the processing units of one core being presented to the operating system as two cores. You can query the number of cores that is used with the get_num_threads() function, which requires including omp.h. There’s also functions for setting the number of threads, getting the thread id (which can be useful e.g. for writing to separate files per thread), and timing.
What to look out for
Sounds easy, right? It is most of the time, but there are some things to watch out for to get correct results. I recommend always verifying that your program works correctly first before parallelizing it, and using results from the non-parallelized version to make sure that the parallelization has not broken things.
Thread-local variables
Variables defined outside the for loop are visible to all threads. This will cause issues if different threads access this variable:
double temp; #pragma omp parallel for for(int i=0;i<n;i++) { temp = func1(data[i]); result[i] = temp+func2(data[i]); }
Since temp is accessed by all threads, it is quite possible that its value gets overwritten by another thread before it can be used in the sum. The solution is to define temp inside the loop, which automatically makes it a thread-local variable of which each thread gets its own. This does of course mean that the memory requirement is equal to the size of the variable times the number of threads, so watch out with large arrays.
Race conditions
A typical issue with parallel code are race conditions, where the results depend on the order in which the loop iterations get executed, such as in this case:
double result[n]; result[0]=0.0; #pragma omp parallel for for(int i=1;i<n;i++) { result[i]=result[i-1]+func(data[i-1],data[i]); }
Such a for loop works fine when it gets executed in order from 1 to n – but the whole point of parallelization is that it no longer gets executed in order, but that the workload is spread over multiple threads. If you wanted to parallelize something like this, you would have to limit the parallel part to a function that does not depend on the previous loop iteration, and then do the summation (in this case) in a single-threaded loop.
Locking
There might be cases where it is necessary for all threads to write results to the same variable. To prevent race conditions due to simultaneous access to the same variable by multiple threads, you can use the #pragma omp atomic statement to tell the compiler that this variable is to be accessed by a single thread at once only, i.e. locking the variable. The downside of locking is that it can slow down the program considerably. Where possible all computations should be contained to variables accessed by a single thread only. For things like computing a sum, this can be achieved with the reduction clause:
// slow: double sum=0.0; #pragma omp parallel for for(int i=0;i<n;i++) { #pragma omp atomic sum += func(data[i]); } // better: double sum=0.0; #pragma omp parallel for reduction(+: sum) for(int i=0;i<n;i++) { sum += func(data[i]); }
The reduction internally does the summation with a thread-local variable, then sums all of these once the loop iterations are completed.
Output
Output is sloooooow, especially if you try to write to a single file from all threads and thus need to employ locking. Where possible you should store results in variables, then do output in a serial portion of the program.
Unnecessary parallelization
Parallelization does not come for free, as there is overhead in starting and finishing threads. This means that you can actually slow down a program by parallelizing for loops that have little workload. You should always time your program before and after parallelization to see if there is actual speed-up. Timing is also a necessity to find the parts of a program that actually take up processing time – which are not always what you think that they’ll be. In the best case, the reduction in runtime scales linearly with the number of threads employed.
Scheduling
By default, the loop iterations are distributed evenly over all available threads. This has the lowest overhead, as the scheduler does not have to interfere during the loop execution. There are however cases where this will not lead to optimal work distribution and hence speed increase:
- Not all loop iterations have the same workload. This can e.g. happen when you are working with irregularly distributed data. A loop iteration dealing with an area of high data density will take longer than one dealing with sparse data.
- Not all threads have the same processing power. This wasn’t an issue in the past, when all cores were identical and thus processed data at the same speed. In recent years (in the case of Intel, starting with the 12th generation Core CPUs), it has become common that CPUs contain different core types. Intel calls these P(erformance) and E(fficiency) cores. This may lead to cases where the faster P-cores are already done with their assigned workload whereas the E-cores are still busy. This can be seen in the task manager – the parallel section is not completed yet the processor load is below 100% as some cores are idling, waiting for the others to finish.
The solution is smarter scheduling. Instead of simply spreading out the number of loop iterations evenly over all available cores, you can tell the scheduler to only give one loop iteration to each thread at a time, then assign the next iteration once a thread becomes available:
#pragma omp parallel for schedule(dynamic) for(int i=0; i<n; i++) { result[i] = do_something_with(data[i]); }
The downside? More overhead, so there are cases where this will be slower than static scheduling. There’s also schedule(guided) that tries to estimate a good batch size for distribution of the loops, but apparently comes with more overhead. My recommendation would be to simply try all three options and see which performs best. I’ve had cases where dynamic scheduling led to a considerable acceleration compared to static scheduling.
More information
This Guide into OpenMP is a great overview of the functionality of OpenMP. Keep in mind though that not all compilers support everything. This is especially true for Microsoft’s C++ compiler, which appears to be stuck somewhere between OpenMP 2.0 and 3.1.
2 thoughts on “How to parallelize for loops in C++”