https://marshallward.org/osm2022.html

In pursuit of peak computational performance

(Or: How fast should my model be running?)

Marshall Ward

NOAA-GFDL

March 1, 2022

Relative Scaling

Relative Comparisons

Goals

Can we talk about absolute model performance?
Can we estimate the idealized peak performance?
When do CPUs and GPUs provide that performance?

Hardware Peak

\[W_\text{FLOPS} = W_\text{core} \times N_\text{cores} \times f_\text{cycles}\]

Platform	Theory (GF/s)	Observed
x86 (Cascade Lake)	2150.4	2145.3
Nvidia (Volta)	7065.6	7007.3

But are these numbers achievable?

How we compute

Cascade Lake: 24 core × 8 (AVX512) × 2 (FMA) × 2 port × 2.8 GHz
Volta: 80 SM × 32 FP64 core × 2 (FMA) × 1.38 GHz

Cost of Business

Platform	FLOPs	Year	Price	TDP(W)
Cascade Lake	2145.3	2019	$6432	240
Volta (GPU)	7007.3	2018	$8999	250

Models

Cascade Lake: Intel(R) Xeon(R) Platinum 8274 CPU @ 3.20GHz
Volta: Nvidia Volta GV100 (1.38GHz)

Peak Performance

Arithmetic Intensity

Operation	FLOPs/byte	AI
$\text{C}_{ij} = \sum_{k} A_{ik} B_{kj}$	$\frac{2n^3}{3n^2 \times \texttt{sizeof}(A)}$	$\frac{1}{12}n$
$\phi_{i} = \phi_{i} + \Delta t \ \mathcal{F}_{i}$	$\frac{2n}{3n \times \texttt{sizeof} (\phi)}$	$\frac{1}{12}$

Array computation is proportional to bandwidth

Traditional CFD

Most solvers are hyperbolic and elliptic, e.g.:

\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[\nabla^2 p = -\nabla \left( \mathbf{u} \cdot \nabla \mathbf{u} \right) - \nabla \cdot \mathcal{F}\]

Pressure (Montgomery potential, PV, ...) solver is a (sparse) matrix operation.

Ocean CFD

Hydrostatic balance eliminates elliptic step:

\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[p = -g \rho\]\[w \sim \frac{\partial \rho}{\partial t} = ...\]

Many (but not all) climate ocean models are hyperbolic

Performance Bounds

$W_\text{FLOPS} \leq I \times B_\text{mem}$

How these are produced

int main() {
   /* ... */
   start(timer);
   for (r = 0; r < r_max; r++) {
       for (int i = 0; i < n; i++)
           kernel(i, a, b, x, y);

       // An impossible branch to block loop interchange
       if (y[0] < 0.) dummy(a, b, x, y);
   }
   stop(timer);
}

void kernel(int i, double a, double b, double *x, double *y) {
    y[i] = a * x[i] + y[i];
}

Apply artificially high iteration, but let the compiler construct the kernel.

The Roofline Model

\[W = \text{min}(W_\text{max}, I \times B_\text{mem})\]

$W$: Performance (FLOP/s)
$W_\text{max}$: Theoretical peak (FLOP/s)
$I$: Arithmetic Intensity (FLOP/byte)
$B_\text{mem}$: Memory bandwidth (byte/s) (cache, RAM, ...)

For our codes, often $W = I \times B_\text{mem}$

Unaccounted Factors

Pipelining
Instruction latency
Instruction decoder
Variable Clock Frequency
Depletion of registers

Peak Bandwidth

Platform	BW (GB/s)
x86 (cache)	8600
x86 (RAM)	140
Volta (RAM)	900

Cache BW is across all cores!

Computing Bandwidth

Platform	Speed (GHz)	Width (B)	BW (GB/s)
x86 cache	2.8*	128*	8602
x86 RAM	2.933	8 × 6	141
Volta RAM	877	1024	898

AVX512 clock speed, L1 load+store, etc

x86 Cache Bandwidth

Three 64B + AGU ports, but only 2 AGU are usable

Euler Step

\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i \qquad I = \frac{2}{3\times 8} = \frac{1}{12}\]

y[i] = y[i] + a * x[i]

Platform	Theory (GF/s)	Observed	%
x86 (Cache)	716.8	715.1	99.7
x86 (RAM)	11.7	9.5	80.8
GPU	74.8	67.3	90.0

RAM efficiencies of 80-90% are common

Euler Step

\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i\]

y[i] = y[i] + a * x[i]

2 FLOPs per 3×8 bytes: $\frac{1}{12}$

Peak Euler

Platform	Observed	Theory	%
x86 (Cache)	715.1	716.8	99.7
x86 (RAM)	9.5	11.7	81.2
GPU	67.3	74.8	90.0

RAM efficiencies of 80-90% are common

Finite Difference

\[\phi_{i+\tfrac{1}{2}} = \frac{1}{2} \left( \phi_i + \phi_{i+1} \right) \qquad I = \frac{2}{2\times 8} = \frac{1}{8}\]

y[i] = a * (x[i] + x[i+1])

Platform	Theory (GF/s)	Observed	%
x86 (Cache)	1075.3	781.4	72.7
x86 (RAM)	17.6	9.4	53.4
GPU	112.3	72.4	64.5

Why are estimates so much lower?

Write-Allocate Effect

Unlike Euler step, target is not on RHS:

y[i] = a * x[i] + a * x[i+1]

Platform	Theory (GF/s)	Observed	%
x86 (Cache)	1075.3	781.4	72.7
x86 (RAM)	11.7	9.4	80.1
GPU	74.8	72.4	96.8

For RAM writes, one implicit read: $I_\text{RAM} = \frac{1}{12}$

W-A Roofline

Diffusion Solver

\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]

y[i] = a * x[i-1] + b * x[i] + a * x[i+1]

x[i] = y[i]

4 FLOPs (amortized) per 4×8 bytes: $I = \frac{1}{8}$

RAM-bound: $I_\text{RAM} = \frac{1}{10}$

How many FLOPS?

Is it four or five operations?

\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]

y[i] = x[i] + a * (x[i-1] - 2 * x[i] + x[i+1])

\[\phi^{n+1}_i = \left( 1 - 2 \frac{\Delta t}{\Delta x^2} \right) \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} + \phi^n_{i-1} \right)\]

y[i] = a * x[i] + b * (x[i-1] + x[i+1])

Do we follow parentheses? Ask the compiler!

Aligned update

Platform	+1	+8	Theory
x86 (Cache)	499.9	690.7	1075.2
x86 (RAM)	10.8	10.8	17.6
GPU	81.1	81.1	112.3

Single-displacement has a performance penalty

Diffusion Performance

Platform	Theory (GF/s)	Observed	%
x86 (Cache)	1075.2	690.7	64.2
x86 (RAM)	14.1	10.8	76.7
GPU	89.9	81.1	90.2

Significant drop in cache performance

Staging Penalty

Stage	Observed	Theory	%
$y \leftarrow x + \alpha \nabla^2 x$	1085 GF/s	1434 GF/s	75.7
$x \leftarrow y$	9145 GB/s	9216* GB/s	99.2
S1 + S2	691 GF/s	1075 GF/s	64.2

Restarting on a new array incurs a performance penalty

NOTE: L1⇋RAM transfer obscures this cost!

Theoretical values

Why 9216 GB/s?

AVX512 copy was clocked at 3.0 GHz, slightly above AVX512 arithmetic at 2.8 GHz.

MOM6 Benchmark

192 × 128 grid, 75 level
- 32 x 32 per MPI rank
- ~1.8M points / field
288 steps (3 day, $\Delta t = 900s$)
"Benchmark" configuration:
- Split barotropic
- Thermodynamic EOS
- Parameterization suite

MOM6 Performance

High FLOPs don't necessarily indicate efficiency

Observed AI

Can we measure the subroutine intensity?

FLOP count is reasonably accurate
We cannot count bytes, but can count transfers

Estimate bandwidth by ratio of instructions.

\[I \approx \frac{N_\text{FLOPs}} {8 \times \left( N_\text{iLoad} + N_\text{iStore} \right) \times \alpha_\text{vec}}\]

where $\alpha_\text{vec} = N_\text{FLOPs} / N_\text{iFLOPs}$

MOM6 Roofline

Throwing points on a roofline diagram offers potentially greater context

Big caveat:

This is work in progress
FLOPs are reasonably confident, bandwidths are not!
But I have a method for estimation that matches idealized tests

(Also: No GPU involved, only added here for context)

Observations:

All functions follow the general "arithmetic intensity trend":
- higher AI means higher FLOP rate
If these are to be believed, then all functions are doing reasonably well
- None languishing in a RAM-bound abyss, but are cache-bound
The fastest functions have high AI but are not necessarily most efficient, and are rather "below the curve"
Two points in the top left are notable (horizontal viscosity, coriolis force) since they were recent targets of major vectorization speedups
Other notably slow functions (btstep, rk2) are both low AI and below the curve
Finally, many functions are arguably "faster" than would be possible on a GPU.

Take these conclusions with some skepticism, this needs more analysis. But I do believe that such an analysis ought to be possible and available for developers.

MOM6: All subroutines

MOM6: 16x Domain

Recent Speedups

Subroutine	Runtime	Speedup
`hor_visc`	1.69s → 1.01s	1.67x
`corAdCalc`	1.11s → 0.64s	1.74x

Highest points correspond to recent optimization efforts

(NOTE: Timings are from a different config)

Procedure

Compute $I$ for operation
If targeting CPUs:
- Seek optimal length, distribute across cores
If targeting GPUs:
- Push for very large domains
Evaluate performance, assess efficiency
- Seek improvements ("move up") if possible

Summary

Arithmetic intensity is a strong indicator of bandwidth-limited performance
CPUs benefit from modest arrays (~1M)
GPUs favor very large arrays and implicit solvers
Roofline analysis enables performance prediction and targeted optimzation

Further Work

Build up a "calculus" of $W_\text{FLOP}$ estimation
Develop confidence in estimating model bandwidth

Future Hardware

Nvidia Hopper/Ampere have even greater bandwidth (> 2-3TB/s?)
Fujitsu's A64FX: Combine cache CPU with HBM memory?

Operation	FLOPs/byte	AI
\(\text{C}_{ij} = \sum_{k} A_{ik} B_{kj}\)	\(\frac{2n^3}{3n^2 \times \texttt{sizeof}(A)}\)	\(\frac{1}{12}n\)
\(\phi_{i} = \phi_{i} + \Delta t \ \mathcal{F}_{i}\)	\(\frac{2n}{3n \times \texttt{sizeof} (\phi)}\)	\(\frac{1}{12}\)

Stage	Observed	Theory	%
\(y \leftarrow x + \alpha \nabla^2 x\)	1085 GF/s	1434 GF/s	75.7
\(x \leftarrow y\)	9145 GB/s	9216* GB/s	99.2
S1 + S2	691 GF/s	1075 GF/s	64.2