forceway

  • Increase font size
  • Default font size
  • Decrease font size

OpenMP tutorial

E-mail Print PDF

 

A five minute OpenMP tutorial...

by John McInnes


I've been hearing more and more about OpenMP over the last few years so I decided to try it out. What is it? OpenMP is an attempt to simplify and standardize parallel programming in C/C++ (and other languages?). Its cross-platform. It can automatically take advantage of as many CPUs as are available, or, run on a single CPU. And in theory, your code can even compile on compilers that don't support OpenMP. This is because it is primarily implemented in the form of #pragmas, although there is a small library of function calls you can optionally use. Currently most of the major compilers support some version of OpenMP (gcc, MSVC). I will only cover the most basic usage here, that is the parallel for loop. I think its a good place to start. Most applications can benefit immediately and will require only minor restructuring, if any at all.

I have an old height-field ray-casting engine from the 90s. Its a good candidate for parallelization. I also have a new dual core processor. So my goal was to multi-thread the code and increase performance. The code, taken from the application's main loop, looks something like this:

for( int y = 0; y < screen_height; y++ ) {
for( int x = 0; x < screen_width; x++ )
{
write_pixel( x, y, process_ray( x, y ) );
}
}

process_ray() is a function that sends a ray out into the world and determines what color the resulting pixel should be. write_pixel() is a function that simply writes a pixel to the screen. This particular code is running at about 100 frames per second on a 2.4GHz Intel Core 2 chip. Windows task manager shows cpu usage at 50%. Now with OpenMP, all I have to do is add this one line:

#pragma omp parallel for
for( int y = 0; y < screen_height; y++ )
{
for( int x = 0; x < screen_width; x++ )
{
write_pixel( x, y, process_ray( x, y ) );
}
}

Bam! OpenMP creates a thread for each processor and splits the work among them. I get 183 frames per second and 99% cpu usage! Its that easy. Just remember to enable OpenMP for your compiler. (On MSVC you should also include omp.h. This will pull in the vcomp.lib which you need to link with.) This is the omp parallel for loop.

There ARE some things to watch out for:

Any variables declared BEFORE the parallel section are public to any threads created. But variables declared WITHIN the parallel section are private, meaning each thread will have its own copy. This includes the loop variable (y in my example). This is good. You don't want a bunch of threads modifying each other's data... but you need to be sure the operation you are performing and the functions you are calling are thread safe. Or maybe it would be better to say OpenMP safe. In my example there are a lot of shared reads. The heightfield, the environment, the camera position, the lighting.. all public/shared data. The process_ray() function reads all this data and uses it to calculate the final pixel color that should go on screen. If you are familiar with multi-threaded programming you know it is ok if different threads read the same data. Its when you start writing that you need to think carefully. In this case the only thing that I am writing is a pixel to a screen buffer. Yes the screen buffer IS shared, but the individual rows will be divided up among the threads. This is because I used y as my loop variable. Each thread will write his pixels, and not touch any others.

Another caveat is that the loop needs to take a certain form such that the OpenMP implementation can determine how many iterations there will be and split the load up. See the documentation.

And a final caveat. You cannot have any 'loop carried dependencies'. Meaning no calculating a value that depends on a previous iteration. Another way to put it is you can't have a loop that must run in sequence to work properly. Here is an example of a loop that won't parallel:

int num = 1;
for ( int i = 0; i < 10; i++ )
{
num += 22;
do_something( num );
}

And here is how you could fix it:

#pragma omp parallel for
for ( int i = 0; i < 10; i++ )
{
int num = 1 + i * 22;
do_something( num );
}


Thats it. Pretty cool. In conclusion, I do think that if you want to squeeze out maximum performance you should use your OS's native thread API and design a custom solution. But.. if you are an app developer, and you want a simple way to put those extra CPU cores to work, then OpenMP is it. Your users will be happy to see their expensive processors chugging away at full speed. As a bonus, OpenMP is cross-platform, your code will run just fine on a single CPU or on multiple CPUs, AND it will compile even if your compiler doesn't support OpenMP, because unsupported #pragmas are ignored. If you want to go farther, OpenMP provides other kinds of parallel sections, functions for locking and synchronization, for controlling the number of threads, and more. Check your compiler docs to see what version of the OpenMP spec that it supports. I'm looking forward to trying my now parallel ray-casting engine on a quad core AMD chip. Just as soon as I can get my hands on one...