The Need for Speed

When time is of the essence, Python is not always the best choice. For a very long time this wasn’t a concern for me but lately I have had reasons to speed up my code, especially my Monte Carlo simulations. There’s a lot you can do in Python to speed things up: thinking carefully about your code to find bottlenecks, vectorisation with NumPy, JIT compilation with Numba, dropping to Cython or C++.

In this post I’ll walk through the Problem of Points, a small kernel from probability theory, and show how to take a baseline Python implementation and make it roughly 80× faster using Numba (a JIT compiler that turns a regular Python function into compiled native code with one decorator and no separate build step).

The Problem of Points

We actually won’t start with a simulation, though the code does contain some loops that will be sped up considerably when moving away from Python. Below is the analytical solution to the classic Problem of Points. I’ll let Wikipedia do the introduction before we move on to the code:

The problem of points, also called the problem of division of the stakes, is a classical problem in probability theory. One of the famous problems that motivated the beginnings of modern probability theory in the 17th century, it led Blaise Pascal to the first explicit reasoning about what today is known as an expected value.

The problem concerns a game of chance with two players who have equal chances of winning each round. The players contribute equally to a prize pot, and agree in advance that the first player to have won a certain number of rounds will collect the entire prize. Now suppose that the game is interrupted by external circumstances before either player has achieved victory. How does one then divide the pot fairly? It is tacitly understood that the division should depend somehow on the number of rounds won by each player, such that a player who is close to winning will get a larger part of the pot. But the problem is not merely one of calculation; it also involves deciding what a “fair” division actually is.

Python

Nowadays, a seasoned gambler would immediately realise that the fair division is that each player get a share of the pot equal to his chances of winning the game, given the current score. This wasn’t so obvious in the 15th century when the problem was first discussed but only realized far later by Pascal and Fermat. So how to compute each player’s chances of winning then? Given the nature of the game this is fairly trivial and below is a somewhat efficient Python solution using combinatorics. It could very likely be improved in terms of speed of execution but will serve its purpose as our baseline to compare other approaches with.

def binomial_coefficient(n, k):
    result = 1
    for i in range(k):
        result *= (n - i) / (i + 1)
    return result

def points_python(a, b, N):
    total_needed = N - a
    total_games = total_needed + (N - b) - 1
    probability = sum(
        binomial_coefficient(total_games, i)
        for i in range(total_needed, total_games + 1)
    )
    
    prob_a_wins = probability / (2 ** total_games)
    
    return prob_a_wins

So let’s try it out: given the score 10-8 in favor of player A, what is his chances of winning if they are set to play until someone reaches 16 points?

points_python(10,8,16)

0.70947265625

In this case, if the game were to be interrupted at this point, player A should receive roughly 71% of the pot. What if the game was played to 32 points instead?

points_python(10,8,32)

0.6170040878776035

This time player A only have about a 62% chance of winning if the game would have continued, which makes sense as he now needs 12 more points to win instead of 6.

import timeit

a, b, N = 10, 8, 32

def time_python():
    p = points_python(a, b, N)

python_time = timeit.timeit(time_python, number=100000)

print(f"Python time: {python_time:.2f}s")

Python time: 1.78s

1.78 seconds for 100,000 evaluations sounds quick. But in Monte Carlo work I might call this kernel millions of times per simulation, and it adds up. This is our baseline. The rest of the post is about making it faster.

What I tried first

My first instinct was to compile to C-style code. I learned just enough Cython to type-annotate the hot loop: a roughly 40× speedup over the Python baseline. Encouraged, I rewrote the kernel in C++, called from Python via a thin Cython wrapper. That bought another factor of two on top, for about 80× total on the machine I was using at the time.

Both worked, but the cost was real: a .pyx file, a .cpp file, compiler flags in the cell magic, the cognitive overhead of three languages talking to each other, and an extra build dance every time I touched the code. The next time I needed to revisit it, I’d left enough scaffolding behind that touching it felt expensive.

These days I reach for Numba instead.

Numba

Numba is a JIT compiler for a numerically-flavoured subset of Python. You decorate a function with @njit, and the first time it’s called Numba hands it to LLVM and gets compiled, type-specialised native code back. No separate files, no build step, no second language. With cache=True the compiled artefact persists across sessions, so subsequent runs skip the warm-up too.

Here is the same kernel as before, with one import and one decorator added:

from numba import njit


@njit(cache=True)
def binomial_coefficient_nb(n, k):
    result = 1.0
    for i in range(k):
        result *= (n - i) / (i + 1)
    return result


@njit(cache=True)
def points_numba(a, b, N):
    total_needed = N - a
    total_games = total_needed + (N - b) - 1
    probability = 0.0
    for i in range(total_needed, total_games + 1):
        probability += binomial_coefficient_nb(total_games, i)
    return probability / (2.0 ** total_games)

Same answer as the Python version, to the last digit:

points_numba(10,8,32)

0.6170040878776035

The first call to points_numba triggers the JIT compile, which we don’t want to time. We warm it up with one throwaway call, then run the same timeit loop as before:

points_numba(a, b, N)  # warm up the JIT

def time_numba():
    p = points_numba(a, b, N)

numba_time = timeit.timeit(time_numba, number=100000)

print(f"Python time: {python_time:.2f}s")
print(f"Numba time:  {numba_time:.2f}s")
print(f"\tSpeedup: {python_time / numba_time:.2f}x")

Python time: 1.78s
Numba time:  0.10s
    Speedup: 18.05x

Tightening the algorithm

A ~24× speedup for the cost of an import and a decorator is a good trade. But notice that binomial_coefficient_nb is called separately for each term in the sum, recomputing most of the multiplications every time. Because we never left Python, fixing this is a small Python rewrite. We can walk the binomial row iteratively, using the recurrence \(\binom{M}{i+1} = \binom{M}{i} \cdot \frac{M-i}{i+1}\), and replace the final 2.0 ** total_games division with math.ldexp, which scales a float by an exact power of two in one instruction:

import math


@njit(cache=True)
def points_numba_fast(a, b, N):
    total_needed = N - a
    total_games = total_needed + (N - b) - 1

    c = 1.0
    acc = 0.0
    for i in range(total_games + 1):
        if i >= total_needed:
            acc += c
        c = c * (total_games - i) / (i + 1)

    return acc * math.ldexp(1.0, -total_games)

Same answer, faster:

points_numba_fast(a, b, N)  # warm up

def time_numba_fast():
    p = points_numba_fast(a, b, N)

numba_fast_time = timeit.timeit(time_numba_fast, number=100000)

print(f"Python time:        {python_time:.2f}s")
print(f"Numba time:         {numba_time:.2f}s ({python_time / numba_time:.2f}x)")
print(f"Numba (recurrence): {numba_fast_time:.2f}s ({python_time / numba_fast_time:.2f}x)")

Python time:        1.78s
Numba time:         0.10s (18.05x)
Numba (recurrence): 0.02s (76.37x)

A reminder that the language compiler can only do so much: when you have a tight inner loop, what it’s looping over still matters. Numba plus a small algorithmic rewrite puts us at roughly 77× faster than pure Python, comfortably in the same ballpark as the old C++ version and without ever opening a .cpp file.

For these blog-friendly problem sizes the runtime floor is now dominated not by the kernel itself but by the cost of crossing the Python-to-native boundary on each call. If I were doing serious Monte Carlo with this kernel, the natural next step would be to batch many evaluations inside a single @njit function so that the dispatch overhead is paid once. But that’s a post for another day.