In pursuit of peak computational performance
(Or: How fast should my model be running?)
Marshall Ward
NOAA-GFDL
March 1, 2022
Relative Scaling
![image]()
Relative Comparisons
![image]()
Goals
- Can we talk about absolute model performance?
- Can we estimate the idealized peak performance?
- When do CPUs and GPUs provide that performance?
Hardware Peak
\[W_\text{FLOPS} = W_\text{core} \times N_\text{cores} \times f_\text{cycles}\]
| x86 (Cascade Lake) |
2150.4 |
2145.3 |
| Nvidia (Volta) |
7065.6 |
7007.3 |
But are these numbers achievable?
How we compute
- Cascade Lake
24 core × 8 (AVX512) × 2 (FMA) × 2 port × 2.8 GHz
- Volta
80 SM × 32 FP64 core × 2 (FMA) × 1.38 GHz
Cost of Business
| Cascade Lake |
2145.3 |
2019 |
$6432 |
240 |
| Volta (GPU) |
7007.3 |
2018 |
$8999 |
250 |
Models
- Cascade Lake
Intel(R) Xeon(R) Platinum 8274 CPU @ 3.20GHz
- Volta
Nvidia Volta GV100 (1.38GHz)
Arithmetic Intensity
| Operation |
FLOPs/byte |
AI |
| \(\text{C}_{ij} = \sum_{k} A_{ik} B_{kj}\) |
\(\frac{2n^3}{3n^2 \times \texttt{sizeof}(A)}\) |
\(\frac{1}{12}n\) |
| \(\phi_{i} = \phi_{i} + \Delta t \ \mathcal{F}_{i}\) |
\(\frac{2n}{3n \times \texttt{sizeof} (\phi)}\) |
\(\frac{1}{12}\) |
Array computation is proportional to bandwidth
Traditional CFD
Most solvers are hyperbolic and elliptic, e.g.:
\[\frac{\partial \mathbf{u}}{\partial t} =
- \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[\nabla^2 p = -\nabla \left( \mathbf{u} \cdot \nabla \mathbf{u} \right)
- \nabla \cdot \mathcal{F}\]
Pressure (Montgomery potential, PV, ...) solver is a (sparse) matrix operation.
Ocean CFD
Hydrostatic balance eliminates elliptic step:
\[\frac{\partial \mathbf{u}}{\partial t} =
- \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[p = -g \rho\]\[w \sim \frac{\partial \rho}{\partial t} = ...\]
Many (but not all) climate ocean models are hyperbolic
How these are produced
int main() {
/* ... */
start(timer);
for (r = 0; r < r_max; r++) {
for (int i = 0; i < n; i++)
kernel(i, a, b, x, y);
// An impossible branch to block loop interchange
if (y[0] < 0.) dummy(a, b, x, y);
}
stop(timer);
}
void kernel(int i, double a, double b, double *x, double *y) {
y[i] = a * x[i] + y[i];
}
Apply artificially high iteration, but let the compiler construct the kernel.
The Roofline Model
\[W = \text{min}(W_\text{max}, I \times B_\text{mem})\]
- \(W\): Performance (FLOP/s)
- \(W_\text{max}\): Theoretical peak (FLOP/s)
- \(I\): Arithmetic Intensity (FLOP/byte)
- \(B_\text{mem}\): Memory bandwidth (byte/s) (cache, RAM, ...)
For our codes, often \(W = I \times B_\text{mem}\)
Unaccounted Factors
- Pipelining
- Instruction latency
- Instruction decoder
- Variable Clock Frequency
- Depletion of registers
Peak Bandwidth
![image]()
| x86 (cache) |
8600 |
| x86 (RAM) |
140 |
| Volta (RAM) |
900 |
Cache BW is across all cores!
Computing Bandwidth
| x86 cache |
2.8* |
128* |
8602 |
| x86 RAM |
2.933 |
8 × 6 |
141 |
| Volta RAM |
877 |
1024 |
898 |
- AVX512 clock speed, L1 load+store, etc
x86 Cache Bandwidth
![image]()
Three 64B + AGU ports, but only 2 AGU are usable
Euler Step
\[\phi^{n+1}_i = \phi^n_i
+ \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i
\qquad I = \frac{2}{3\times 8} = \frac{1}{12}\]
y[i] = y[i] + a * x[i]
| x86 (Cache) |
716.8 |
715.1 |
99.7 |
| x86 (RAM) |
11.7 |
9.5 |
80.8 |
| GPU |
74.8 |
67.3 |
90.0 |
RAM efficiencies of 80-90% are common
Euler Step
\[\phi^{n+1}_i = \phi^n_i
+ \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i\]
y[i] = y[i] + a * x[i]
2 FLOPs per 3×8 bytes: \(\frac{1}{12}\)
Peak Euler
| x86 (Cache) |
715.1 |
716.8 |
99.7 |
| x86 (RAM) |
9.5 |
11.7 |
81.2 |
| GPU |
67.3 |
74.8 |
90.0 |
RAM efficiencies of 80-90% are common
Finite Difference
\[\phi_{i+\tfrac{1}{2}} = \frac{1}{2} \left( \phi_i + \phi_{i+1} \right)
\qquad I = \frac{2}{2\times 8} = \frac{1}{8}\]
y[i] = a * (x[i] + x[i+1])
| x86 (Cache) |
1075.3 |
781.4 |
72.7 |
| x86 (RAM) |
17.6 |
9.4 |
53.4 |
| GPU |
112.3 |
72.4 |
64.5 |
Why are estimates so much lower?
Write-Allocate Effect
Unlike Euler step, target is not on RHS:
y[i] = a * x[i] + a * x[i+1]
| x86 (Cache) |
1075.3 |
781.4 |
72.7 |
| x86 (RAM) |
11.7 |
9.4 |
80.1 |
| GPU |
74.8 |
72.4 |
96.8 |
For RAM writes, one implicit read: \(I_\text{RAM} = \frac{1}{12}\)
W-A Roofline
![image]()
Diffusion Solver
\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2}
\left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]
y[i] = a * x[i-1] + b * x[i] + a * x[i+1]
x[i] = y[i]
4 FLOPs (amortized) per 4×8 bytes: \(I = \frac{1}{8}\)
RAM-bound: \(I_\text{RAM} = \frac{1}{10}\)
How many FLOPS?
Is it four or five operations?
\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2}
\left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]
y[i] = x[i] + a * (x[i-1] - 2 * x[i] + x[i+1])
\[\phi^{n+1}_i = \left( 1 - 2 \frac{\Delta t}{\Delta x^2} \right) \phi^n_i
+ \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} + \phi^n_{i-1} \right)\]
y[i] = a * x[i] + b * (x[i-1] + x[i+1])
Do we follow parentheses? Ask the compiler!
Aligned update
| x86 (Cache) |
499.9 |
690.7 |
1075.2 |
| x86 (RAM) |
10.8 |
10.8 |
17.6 |
| GPU |
81.1 |
81.1 |
112.3 |
Single-displacement has a performance penalty
Staging Penalty
| \(y \leftarrow x + \alpha \nabla^2 x\) |
1085 GF/s |
1434 GF/s |
75.7 |
| \(x \leftarrow y\) |
9145 GB/s |
9216* GB/s |
99.2 |
| S1 + S2 |
691 GF/s |
1075 GF/s |
64.2 |
Restarting on a new array incurs a performance penalty
NOTE: L1⇋RAM transfer obscures this cost!
Theoretical values
Why 9216 GB/s?
- AVX512 copy was clocked at 3.0 GHz, slightly above AVX512 arithmetic at 2.8 GHz.
MOM6 Benchmark
![image]()
- 192 × 128 grid, 75 level
- 32 x 32 per MPI rank
- ~1.8M points / field
- 288 steps (3 day, \(\Delta t = 900s\))
- "Benchmark" configuration:
- Split barotropic
- Thermodynamic EOS
- Parameterization suite
Observed AI
Can we measure the subroutine intensity?
- FLOP count is reasonably accurate
- We cannot count bytes, but can count transfers
Estimate bandwidth by ratio of instructions.
\[I \approx \frac{N_\text{FLOPs}}
{8 \times \left( N_\text{iLoad} + N_\text{iStore} \right)
\times \alpha_\text{vec}}\]
where \(\alpha_\text{vec} = N_\text{FLOPs} / N_\text{iFLOPs}\)
MOM6 Roofline
![image]()
MOM6: All subroutines
![image]()
MOM6: 16x Domain
![image]()
Recent Speedups
hor_visc |
1.69s → 1.01s |
1.67x |
corAdCalc |
1.11s → 0.64s |
1.74x |
Highest points correspond to recent optimization efforts
(NOTE: Timings are from a different config)
Procedure
- Compute \(I\) for operation
- If targeting CPUs:
- Seek optimal length, distribute across cores
- If targeting GPUs:
- Push for very large domains
- Evaluate performance, assess efficiency
- Seek improvements ("move up") if possible
Summary
- Arithmetic intensity is a strong indicator of bandwidth-limited performance
- CPUs benefit from modest arrays (~1M)
- GPUs favor very large arrays and implicit solvers
- Roofline analysis enables performance prediction and targeted optimzation
Further Work
- Build up a "calculus" of \(W_\text{FLOP}\) estimation
- Develop confidence in estimating model bandwidth
Future Hardware
- Nvidia Hopper/Ampere have even greater bandwidth (> 2-3TB/s?)
- Fujitsu's A64FX: Combine cache CPU with HBM memory?