Recently I was attempting to port some scientific code to run on the CPU and GPU via Thrust. Thrust is a really powerful algorithms library in C++ similar to the Standard Template Libarary (STL), but it mainly operates on linear vectors of values. For the code I was parallelizing it didn’t seem to match this well at first. Fortunately Thrust provides some real power-tools, and here’s the recipe I came up with.

## Problem

for (double theta=thetaLow; theta<thetaHigh; theta+=dtheta) { const double dphi = (phiHigh - phiLow)/100; const double outer_value = compute_outer(dphi, theta); for (int j=0; j<numJ; j++) { for (int i=0; i<numI; i++) { for (double phi=phiLow; phi<phiHigh; phi+=dphi) { double value = compute_value(theta, phi, i, j, outer_value); result_buckets[j][i] += value; } } } }

- Multiple loop nests with non-cannoncal forms
- Data structures that are multi-dimensional
- The reduction is segmented
- Outer loop values are shared across inner loop computations
- Traversal of reduction segments is sparse, not dense

This problem kind-of seems like a weird combination of outer-product then reduction, similar to a form of MapReduce. It generates four dimensions of values and then integrates them down into two dimensions. Fortunately each computation and the reduction is commutative, so we can extract all the parallelism in these loops.

## Recipe

**Overview**

The original loop nest from outer to inner was `[theta][i][j][phi]`

, however the functionality is to do a summation over all `[theta][phi]`

values for each `[i][j]`

pair. To parallelize this we’ll be turning that serial summation into a parallel reduction, and because each `[i][j]`

pair is independent of each other we can do all `i x j`

of these reductions in parallel, too.

To achieve this we want to create a reduction operation across all `[i][j][theta][phi]`

values but have the reduction operations routed to specific `[i][j]`

buckets. By generating a key value for each `[i][j]`

bucket we can use `thrust::reduce_by_key`

to achieve the desired segmenting of the reduction.

Since the values we’re generating as part of the reduction are based on indices and reading stateless data in `compute_value`

we can implement the body of the computation as a transform operation on the linearized index of all `[i][j][theta][phi]`

.

**Ingredients:**

- custom thrust::unary_function‘s to both create the key segmentation for the reduction and to generate the actual computed values
- explicit computation of the index space for the non-canonical loops
- thrust::reduce_by_key to perform the reduction
- thrust::counting_iterator to generate an implicit computed input sequence for iterating over (our flattened “loop indices”)
- thrust::transform_iterator to get those unary_function’s to act on the reduction as well as perform kernel fusion (a good thing!)
- thrust::discard_iterator so we don’t waste time or space on output that isn’t needed

Let’s do it!

**Unary Functors!**

We use two unary functions (functions that transform a single input into a single output) in this recipe. While they look like verbose C++, the functionality is straight-forward.

The first unary function is responsible for transforming a sequence value from `[i][j][theta][phi]`

into a key value representing a `[i][j]`

bucket.

struct key_functor : public thrust::unary_function<int,int> { const int Nphi, Ntheta; key_functor(int _Nphi, int _Ntheta) : Nphi(_Nphi), Ntheta(_Ntheta) {} __host__ __device__ int operator()(int x) { return x / (Nphi * Ntheta); } };

The second unary function actually computes the pre-reduction value for every point in our iteration space. It’s what implements the body of the loops we’re parallelizing. The operation is on a single element of our `[i][j][theta][phi]`

space and the function object computes a single output value for each element in this multi-dimensional space as input to the final reduction.

struct value_functor : public thrust::unary_function<double,int> { const int Nphi, Ntheta, Nj, Ni; const double dtheta, thetaLow; const double dphi, phiLow; value_functor(int _Nphi, int _Ntheta, int _Nj, int _Ni double _dtheta, double _thetaLow, double _dphi, double _phiLow) : Nphi(_Nphi), Ntheta(_Ntheta), Nj(_Nj), Ni(_Ni), dtheta(_dtheta), thetaLow(_thetaLow), dphi(_dphi), phiLow(_phiLow) {} __host__ __device__ double operator()(int index) const { // calculate original loop indices // "loop" order is now [i][j][theta][phi] const int phiIdx = index % Nphi; const int thetaIdx = (index / Nphi) % Ntheta; const int jIdx = (index / (Nphi * Ntheta)) % Nj; const int iIdx = (index / (Nphi * Ntheta * Nj)) % Ni; // Generate our original loop values from the base value and stride const double theta = thetaLow + thetaIdx * dtheta; const double phi = phiLow + phiIdx * dphi; // Recompute our outer-loop value (replicate, don't communicate!) const double outer_value = compute_outer(dphi, theta); // Compute the value for this [i][j][phi][theta] element return compute_value(theta, phi, iIdx, jIdx, outer_value); } };

**Implicit iterators for efficient indicies**

Now that we have our unary functions which implement the behavior we want to parallelize the next challenge is to invoke them properly. To do this we’ll want to create a linear sequence of indices covering all the `[i][j][theta][phi]`

elements.

First we’ll need to figure out the range of `theta`

and `phi`

as those loops were not originally index based. Then, rather than store all of the index data in memory we’ll use an implicit iterator which can compute the needed value on-the-fly.

const int numPhi = (phiHigh - phiLow)/dphi; const int numTheta = (thetaHigh - thetaLow)/dthetha; // Create a linearized index of [i][j][theta][phi] thrust::counting_iterator<int> index_begin(0); thrust::counting_iterator<int> index_end(numI * numJ * numPhi * numTheta);

**Invoking it all**

The final step once we have our input index sequence defined is to invoke the whole thing. Since our unary functions take in this index and then generate the needed bucket key or computational value we can use thrust’s transform_iterator to fuse these computations at compile-time with the implementation of the ultra-optimized thrust reduction operation. This is really important, because it means we won’t have to save off all the intermediate values generated for each element. Since we’ve got a lot of parallelism here, that’s a lot of data-shuffling to save.

using thrust; // Storage for the of our reduction output consisting of a linear vector of [i][j] buckets host_vector<double> output(numI * numJ); // Capture all the input values needed to generate each element value_functor value_generator(numPhi, numTheta, numJ, numI, dtheta, thetaLow, dphi, phiLow); // Do a reduction for each [i][j] bucket across the values for each [theta][phi] reduce_by_key( // first [i][j] bucket key make_transform_iterator(index_begin, key_functor(numPhi, numTheta)), // last [i][j] bucket key make_transform_iterator(index_end, key_functor(numPhi, numTheta)), // generator of values make_transform_iterator(index_end, value_generator), // the compacted key list output is not needed make_discard_iterator(), // The output of [i][j] buckets with summed values output.begin() );

Here we’ve allocated a vector for the output, it could just as well be a `device_vector`

, too. Then we created the `value_functor`

objecte for value generation, capturing all the input values necessary.

Now we’re on to the invocation of the reduction operation. The first two arguments are transformations of the indices into our reduction bucket keys with the `key_functor`

objects handling the duties. The third argument uses the same input index sequence with our value generating functor to create the inputs to the reduction operation. The fourth argument discards an unneeded output. Finally we provide an iterator argument for our output of the reduction.

**Getting the data back out**

If, as I in my case, this code is sitting in the middle of a bunch of existing code that’s using the multi-dimensional arrays elsewhere you’ll want to pull this data back out from Thrust’s containers.

thrust::host_vector<double>::iterator iter = output.begin(); for (int i=0; j<numI; i++, thrust::advance(iter, numI)) { thrust::copy_n(iter, numJ, &result_buckets[i][0]); }

And you’re done!

## Conclusion

If you’re a JavaScript or other dyanmic language programmer you’ll not be used to crafting your continuations and closures by hand like this, the good news is that the Lambda language feature in C++11 will slice right through this verbosity soon enough.

However you’ll still have to do the hard work of parallelizing: searching for data or loop dependencies, re-architecting the loop structure, and picking the right primitives for your parallel operation.

Doing this hard work with Thrust has some great payoffs, though. For parallel code that maps onto Thrust is that this algorithm will be both portable and will leverage expert-optimized primitive routines on each platform. The code can be run in parallel on both CPUs and GPUs via Thrust’s TBB, OpenMP, and CUDA back-ends, or just serially on the CPU if you prefer to debug the code there (or just run valgrind on it!).

Joseph SmarrMarch 6, 2012 / 12:55 pmNice example, and easy to follow even though I didn’t know anything about thrust before. Kinda cool that, as you say, you’re essentially programming a distributed MapReduce task but running it all on a single device. Would have been cool to end with some graphs/stats of how much more efficient this is vs. your original code though–some concrete way to show that the payoff was worth the work. ðŸ™‚

sipherhexMarch 16, 2012 / 12:49 amI’ll have a follow-up post with the payoff.