I occassionally read Darren Wilkinson's blog, and you should too. He talks about some pretty interesting stuff in the scientific and statistical computing fields.
I've taken a keen interest in some of the parallel implementations of Monte Carlo algorithms he's written about, most recently he wrote an implementation using Scala. You can check out the post here.
I decided to try my hand at writing an implementation of Darren's toy problem using python and NumPy. It also gave me the opportunity to do some parallel programming in python since most of the parallel programming I do is either in C or C++ using MPI. For a reminder of his toy problem, see the post linked above.
I'll cut straight to the chase and just post the code. I've commented it so it should be fairly easy to read.
import sys import numpy as np from multiprocessing import Pool def integrate(its): # I totally cheated and tweaked the number of chunks # to get the fastest result chunks = 10000 chunk_size = its / chunks np.random.seed() # Each process needs a different seed! sum = 0.0 for i in xrange(chunks): # For each chunk... # ...do a vectorised Monte Carlo calculation u = np.random.uniform(size=its/chunks) sum += np.sum(np.exp(-u * u)) # Do the Monte Carlo # We did 'its' total iterations in this process, so # normalise the result and return it return sum / float(its) if __name__ == '__main__': num_procs = int(sys.argv) iters = 1000000000 its = iters / num_procs # Each process gets a share of the iterations pool = Pool(processes=num_procs) # Each process calls 'integrate' with 'its' iterations result = pool.map(integrate, [its] * num_procs) # pool.map returns a list of length 'num_procs', with # element 'i' being the return value of 'integrate' from # process 'i' # Renormalise by the number of processors print sum(result) / float(num_procs)
And here are the timings. I just used the linux
time command, so take these with a pinch of salt.
num_procs time 1 0m23.578s 2 0m12.111s 3 0m8.490s 4 0m7.898s 5 0m7.492s 6 0m6.989s 7 0m6.497s 8 0m6.353s 9 0m6.402s 10 0m6.459s 11 0m6.460s 12 0m6.525s 16 0m6.680s 32 0m7.333s 64 0m8.692s
Can you guess how many cores my laptop has?
Comparing with the Scala code yields some interesting things. The python + NumPy code above will run anywhere from two (
num_proc = 8) to four times (
num_proc = 1) faster than the Scala code. However, note that I did spend some time tweaking the chunksize. The use of vectorising some of the Monte Carlo calculations in NumPy also helps a lot.
Note: If you plan on running this, be careful passing in a number of processes that doesn't divide the number of iterations, since all of those divisions are integer divisions in python 2.x.