Forms of Parallelism

December 9, 2019 · 3 min read

Software Engineer

Moore's law is dead

We've had limitations like this before each time requiring the industry to innovate.
This has given us out-of-order execution, multithreading, SIMD and many other innovations.

One side effect of this is that the failure to maintain consistent die shrinks has resulted in the industry scaling out bringing parallelism to the forefront.

AMD creating 64 core 128 thread server processors
The latest MacBook pro running an 8 core CPU with 16 threads
The latest iPad pro now able to use all it's 8 ARM cores at once in a fanless design

Yet from the software side very little software makes use of this raw performance.

Even when toolchains are released like new lower level graphics API such as Vulkan, Metal and DirectX12 has allowed for better use of CPU cores adoption is slow.

The reason being usually the programmers time is worth more to companies than CPU time.

I'm going to go through all the different forms of parallelism from parallelism on a single core to distributed compute on multiple machines.
Each topic will get a post as an explainer along with a follow up piece detailing the various ways it can be implemented in actual code.

Today we will be going over SIMD.

SIMD#

What if I told you a single core could work on multiple pieces of data at the same time.

Well it can using SIMD (Single Instruction Multiple Data).
Most if not all mainstream architectures support some form of SIMD

Intel's AVX512 on Skylake-X chips allowing 16 32-bit single precision double operations
MIPS MSA
RISC-V RVV
ARM Neon
PowerPC AltiVec

There's even support from WASM with the proposal 128-bit SIMD WebAssembly.

SIMD instructions are implemented by special hardware usually a combination of additional registers and and floating point units. This is very similar to how GPU's operate whereby SMS's/Compute Units are made up of several ALU's.

This is why they are good for similar types of problems like physics and machine learning just at different scales .

There are many ways of using SIMD in your code.

1. Use inline assembly#

Issues:

Not portable
Error prone
Time consuming
Can prevent other optimisations

Advantages:

More fine grain control

2. Use Intrinsics#

Similar to the above

3. Library Implementation#

As of Swift 5 has a proper SIMD library implementation based on Apple's simd/simd.h library for c and c++ (Prior to Swift 5 Swift still had SIMD)

4. Use OpenMP#

Cross platform
Easy to understand
Decent support

5. Enable auto-vectorisation and cross your fingers and hope your compiler does it for you#

Advantages:

Works well for simple cases
Doesn't affect code quality / add dependencies

Disadvantages:

Nearly impossible to tell is a loop will be vectorized without the help of godbolt