Perhaps everything in our mainstream programming languages is at least 50 years old. Loops, iterators, pointers and references to mutable memory locations appeared in Fortran or ALGOL and now they are part of all mainstream programming languages.
These constructs were designed for developing sequential programs. But today computers have many cores and processors and we want to do more computations in parallel. We bolted some parallel features on our sequential programming languages, but sequentiality remains built into our mindset and our infrastructure.
We need to change our mindset.
Many programming languages manage allocations of storage automatically. Instead of the programmer, the compiler or the run-time system claims and frees memory. The implementation of this automatism is language specific:
- Scope-limited automatic variables (C++)
- Tracing garbage collection (Java, Python, Go, Haskell)
- Ownership tracking (Rust)
Is it be possible to make parallelism, that is allocation of code to processors, automatic?
With automatic memory management we stopped using
automate parallelism we need restrict our programming style and give up
sequential programming language artifacts from the sixties.
Let's take the textbook example of computing \(n!\):
def factorial(n): result = 1 # ① for i in range(1, n + 1): # ② result = result * i # ③ return result
This is code is written today's most common programming style: sequential, imperative, mutable and allows uncontrolled side effects. Every line of this function states what happens during execution:
- Assign an initial value to
- Use the value of
idrawn from the specified range.
resultwith the current value of
Many consider this implementation "readable" and "simple" because we are used
to seeing such programs. But this code contains many accidental aspects: the
result accumulator, the intermediate value
i, and the allusion to
sequential execution. These are unrelated to the original problem statement.
This is a form of complexity caused by control.
Keeping the essential
Let's strip off everything but the essential from the previous implementation
from functools import reduce from operator import mul def factorial(n): return reduce(mul, range(1, n + 1))
This implementation reads almost as a pure specification without any view on the execution. This form is more unusual but simpler than before because all accidental complexity were removed.
As programmers, we give up the control of the execution order and we let the runtime environment choose the most efficient execution strategy. In case of Python the execution would be similar, or perhaps identical to that of the first implementation. Using this declarative style, however, we could imagine a programming system where the actual execution strategy would depend on multiple factors:
- For small
- For large
nsplit the range among multiple processors, then merge the results.
- Push some parts of the computation to the GPU.
- Send the computation to a massively parallel super-computer.
In general, operations cannot be parallelized arbitrarily. To allow for such flexible runtime behavior we must enrich our programs with hints to the compiler or to the runtime environment.
If we recognize and communicate our problem's algebraic properties to the compiler or to the run-time, it can exploit alternate representations and implementations.
Well-known algebraic properties translate to useful hints:
- Associative: grouping doesn't matter
- Commutative: order doesn't matter
- Idempotent: duplicates don't matter
- Identity: the current value doesn't matter
- Zero: other values don't matter
factorial the integer multiplication is associative, therefore performing
the multiplication in groups first, then merging the partial results is a
correct parallel implementation. It is also commutative, so we can do the
merging in any order.
Automatic parallelism in practice
Optimizing compilers generate efficient, often parallel code from a sequential, imperative code. But in general, a program organized according to linear problem decomposition principles is hard to parallelize. Fran Allen won the 2006 Turing-award for her work in program optimization and parallelization.
Dask is a flexible parallel computing library for Python. Its user interface mimics the programming experience of popular data processing libraries. For example, you write regular NumPy array manipulation code and the Dask scheduler distributes the computations across multiple threads or among the nodes of a cluster.
Facebook's Haxl library can automatically execute independent data fetching operations concurrently. Haxl, with the compiler's assistance, recognizes algebraic properties such as applicative functor and monad to generate efficient concurrent code. Simon Marlow's presentation is a great introduction of the ideas behind this tool.
When we code in the imperative style, accidental complexity of control creeps into our programs. The programmer, instead of the expressing problem's essence, is burdened with managing loops, states and the details of a sequential-looking runtime behavior.
Code written in declarative, functional style with no ties to a specific execution model may be automatically parallelized by the underlying system. Algebraic properties constrain which execution strategies are correct and efficient.
I also recommend reading Bartosz Milewski's post on Parallel Programming with Hints