Parallel Monte Carlo using python and NumPy

I occassionally read Darren Wilkinson's blog, and you should too. He talks about some pretty interesting stuff in the scientific and statistical computing fields.

I've taken a keen interest in some of the parallel implementations of Monte Carlo algorithms he's written about, most recently he wrote an implementation using Scala. You can check out the post here.

I decided to try my hand at writing an implementation of Darren's toy problem using python and NumPy. It also gave me the opportunity to do some parallel programming in python since most of the parallel programming I do is either in C or C++ using MPI. For a reminder of his toy problem, see the post linked above.

I'll cut straight to the chase and just post the code. I've commented it so it should be fairly easy to read.

import sys  
import numpy as np  
from multiprocessing import Pool

def integrate(its):  
    # I totally cheated and tweaked the number of chunks
    # to get the fastest result
    chunks = 10000
    chunk_size = its / chunks

    np.random.seed()  # Each process needs a different seed!

    sum = 0.0
    for i in xrange(chunks):  # For each chunk...
        # ...do a vectorised Monte Carlo calculation
        u = np.random.uniform(size=its/chunks)
        sum += np.sum(np.exp(-u * u))  # Do the Monte Carlo

    # We did 'its' total iterations in this process, so
    # normalise the result and return it
    return sum / float(its)

if __name__ == '__main__':  
    num_procs = int(sys.argv[1])

    iters = 1000000000
    its = iters / num_procs  # Each process gets a share of the iterations

    pool = Pool(processes=num_procs)

    # Each process calls 'integrate' with 'its' iterations
    result = pool.map(integrate, [its] * num_procs)

    # pool.map returns a list of length 'num_procs', with
    # element 'i' being the return value of 'integrate' from
    # process 'i'

    # Renormalise by the number of processors
    print sum(result) / float(num_procs)

And here are the timings. I just used the linux time command, so take these with a pinch of salt.

num_procs  time  
1          0m23.578s  
2          0m12.111s  
3          0m8.490s  
4          0m7.898s  
5          0m7.492s  
6          0m6.989s  
7          0m6.497s  
8          0m6.353s  
9          0m6.402s  
10         0m6.459s  
11         0m6.460s  
12         0m6.525s

16         0m6.680s  
32         0m7.333s  
64         0m8.692s  

Can you guess how many cores my laptop has?

Comparing with the Scala code yields some interesting things. The python + NumPy code above will run anywhere from two (num_proc = 8) to four times (num_proc = 1) faster than the Scala code. However, note that I did spend some time tweaking the chunksize. The use of vectorising some of the Monte Carlo calculations in NumPy also helps a lot.

Note: If you plan on running this, be careful passing in a number of processes that doesn't divide the number of iterations, since all of those divisions are integer divisions in python 2.x.