(Or: How fast *should* my model be running?)

Marshall Ward

NOAA-GFDL

March 1, 2022

- Can we talk about
*absolute*model performance? - Can we estimate the idealized
*peak*performance? - When do CPUs and GPUs provide that performance?

\[W_\text{FLOPS} = W_\text{core} \times N_\text{cores} \times f_\text{cycles}\]

Platform | Theory (GF/s) | Observed |
---|---|---|

x86 (Cascade Lake) | 2150.4 | 2145.3 |

Nvidia (Volta) | 7065.6 | 7007.3 |

But are these numbers achievable?

- Cascade Lake
24 core × 8 (AVX512) × 2 (FMA) × 2 port × 2.8 GHz

- Volta
80 SM × 32 FP64 core × 2 (FMA) × 1.38 GHz

Platform | FLOPs | Year | Price | TDP(W) |
---|---|---|---|---|

Cascade Lake | 2145.3 | 2019 | $6432 | 240 |

Volta (GPU) | 7007.3 | 2018 | $8999 | 250 |

- Cascade Lake
Intel(R) Xeon(R) Platinum 8274 CPU @ 3.20GHz

- Volta
Nvidia Volta GV100 (1.38GHz)

Operation | FLOPs/byte | AI |

\(\text{C}_{ij} = \sum_{k} A_{ik} B_{kj}\) | \(\frac{2n^3}{3n^2 \times \texttt{sizeof}(A)}\) | \(\frac{1}{12}n\) |

\(\phi_{i} = \phi_{i} + \Delta t \ \mathcal{F}_{i}\) | \(\frac{2n}{3n \times \texttt{sizeof} (\phi)}\) | \(\frac{1}{12}\) |

Array computation is proportional to bandwidth

Most solvers are hyperbolic and elliptic, e.g.:

\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[\nabla^2 p = -\nabla \left( \mathbf{u} \cdot \nabla \mathbf{u} \right) - \nabla \cdot \mathcal{F}\]

Pressure (Montgomery potential, PV, ...) solver is a (sparse) matrix operation.

Hydrostatic balance eliminates elliptic step:

\[\frac{\partial \mathbf{u}}{\partial t} = - \mathbf{u} \cdot \nabla \mathbf{u} -\nabla p + \mathcal{F}\]\[p = -g \rho\]\[w \sim \frac{\partial \rho}{\partial t} = ...\]

Many (but not all) climate ocean models are hyperbolic

\(W_\text{FLOPS} \leq I \times B_\text{mem}\)

```
int main() {
/* ... */
start(timer);
for (r = 0; r < r_max; r++) {
for (int i = 0; i < n; i++)
kernel(i, a, b, x, y);
// An impossible branch to block loop interchange
if (y[0] < 0.) dummy(a, b, x, y);
}
stop(timer);
}
void kernel(int i, double a, double b, double *x, double *y) {
y[i] = a * x[i] + y[i];
}
```

Apply artificially high iteration, but let the compiler construct the kernel.

\[W = \text{min}(W_\text{max}, I \times B_\text{mem})\]

- \(W\): Performance (FLOP/s)
- \(W_\text{max}\): Theoretical peak (FLOP/s)
- \(I\): Arithmetic Intensity (FLOP/byte)
- \(B_\text{mem}\): Memory bandwidth (byte/s) (cache, RAM, ...)

For our codes, often \(W = I \times B_\text{mem}\)

- Pipelining
- Instruction latency
- Instruction decoder
- Variable Clock Frequency
- Depletion of registers

Platform | BW (GB/s) |
---|---|

x86 (cache) | 8600 |

x86 (RAM) | 140 |

Volta (RAM) | 900 |

Cache BW is across *all* cores!

Platform | Speed (GHz) | Width (B) | BW (GB/s) |
---|---|---|---|

x86 cache | 2.8* | 128* | 8602 |

x86 RAM | 2.933 | 8 × 6 | 141 |

Volta RAM | 877 | 1024 | 898 |

- AVX512 clock speed, L1 load+store, etc

Three 64B + AGU ports, but only 2 AGU are usable

\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i \qquad I = \frac{2}{3\times 8} = \frac{1}{12}\]

`y[i] = y[i] + a * x[i]`

Platform | Theory (GF/s) | Observed | % |
---|---|---|---|

x86 (Cache) | 716.8 | 715.1 | 99.7 |

x86 (RAM) | 11.7 | 9.5 | 80.8 |

GPU | 74.8 | 67.3 | 90.0 |

RAM efficiencies of 80-90% are common

\[\phi^{n+1}_i = \phi^n_i + \Delta t \left( \frac{\partial \phi}{\partial t} \right)_i\]

`y[i] = y[i] + a * x[i]`

2 FLOPs per 3×8 bytes: \(\frac{1}{12}\)

Platform | Observed | Theory | % |
---|---|---|---|

x86 (Cache) | 715.1 | 716.8 | 99.7 |

x86 (RAM) | 9.5 | 11.7 | 81.2 |

GPU | 67.3 | 74.8 | 90.0 |

RAM efficiencies of 80-90% are common

\[\phi_{i+\tfrac{1}{2}} = \frac{1}{2} \left( \phi_i + \phi_{i+1} \right) \qquad I = \frac{2}{2\times 8} = \frac{1}{8}\]

`y[i] = a * (x[i] + x[i+1])`

Platform | Theory (GF/s) | Observed | % |
---|---|---|---|

x86 (Cache) | 1075.3 | 781.4 | 72.7 |

x86 (RAM) | 17.6 | 9.4 | 53.4 |

GPU | 112.3 | 72.4 | 64.5 |

Why are estimates so much lower?

Unlike Euler step, target is not on RHS:

`y[i] = a * x[i] + a * x[i+1]`

Platform | Theory (GF/s) | Observed | % |
---|---|---|---|

x86 (Cache) | 1075.3 | 781.4 | 72.7 |

x86 (RAM) | 11.7 | 9.4 | 80.1 |

GPU | 74.8 | 72.4 | 96.8 |

For RAM writes, one implicit read: \(I_\text{RAM} = \frac{1}{12}\)

\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]

```
y[i] = a * x[i-1] + b * x[i] + a * x[i+1]
x[i] = y[i]
```

4 FLOPs (amortized) per 4×8 bytes: \(I = \frac{1}{8}\)

RAM-bound: \(I_\text{RAM} = \frac{1}{10}\)

Is it four or five operations?

\[\phi^{n+1}_i = \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} - 2 \phi^n_i + \phi^n_{i-1} \right)\]

`y[i] = x[i] + a * (x[i-1] - 2 * x[i] + x[i+1])`

\[\phi^{n+1}_i = \left( 1 - 2 \frac{\Delta t}{\Delta x^2} \right) \phi^n_i + \frac{\Delta t}{\Delta x^2} \left(\phi^n_{i+1} + \phi^n_{i-1} \right)\]

`y[i] = a * x[i] + b * (x[i-1] + x[i+1])`

Do we follow parentheses? Ask the compiler!

Platform | +1 | +8 | Theory |
---|---|---|---|

x86 (Cache) | 499.9 | 690.7 | 1075.2 |

x86 (RAM) | 10.8 | 10.8 | 17.6 |

GPU | 81.1 | 81.1 | 112.3 |

Single-displacement has a performance penalty

Platform | Theory (GF/s) | Observed | % |
---|---|---|---|

x86 (Cache) | 1075.2 | 690.7 | 64.2 |

x86 (RAM) | 14.1 | 10.8 | 76.7 |

GPU | 89.9 | 81.1 | 90.2 |

Significant drop in cache performance

Stage | Observed | Theory | % |
---|---|---|---|

\(y \leftarrow x + \alpha \nabla^2 x\) | 1085 GF/s | 1434 GF/s | 75.7 |

\(x \leftarrow y\) | 9145 GB/s | 9216* GB/s | 99.2 |

S1 + S2 | 691 GF/s | 1075 GF/s | 64.2 |

Restarting on a new array incurs a performance penalty

**NOTE**: L1⇋RAM transfer obscures this cost!

Why 9216 GB/s?

- AVX512 copy was clocked at 3.0 GHz, slightly above AVX512 arithmetic at 2.8 GHz.

- 192 × 128 grid, 75 level
- 32 x 32 per MPI rank
- ~1.8M points / field

- 288 steps (3 day, \(\Delta t = 900s\))
- "Benchmark" configuration:
- Split barotropic
- Thermodynamic EOS
- Parameterization suite

High FLOPs don't necessarily indicate efficiency

Can we measure the subroutine intensity?

- FLOP count is reasonably accurate
- We
*cannot*count bytes, but can count*transfers*

Estimate bandwidth by ratio of instructions.

\[I \approx \frac{N_\text{FLOPs}} {8 \times \left( N_\text{iLoad} + N_\text{iStore} \right) \times \alpha_\text{vec}}\]

where \(\alpha_\text{vec} = N_\text{FLOPs} / N_\text{iFLOPs}\)

Subroutine | Runtime | Speedup |
---|---|---|

`hor_visc` |
1.69s → 1.01s | 1.67x |

`corAdCalc` |
1.11s → 0.64s | 1.74x |

Highest points correspond to recent optimization efforts

(NOTE: Timings are from a different config)

- Compute \(I\) for operation
- If targeting CPUs:
- Seek optimal length, distribute across cores

- If targeting GPUs:
- Push for very large domains

- Evaluate performance, assess efficiency
- Seek improvements ("move up") if possible

- Arithmetic intensity is a strong indicator of bandwidth-limited performance
- CPUs benefit from modest arrays (~1M)
- GPUs favor very large arrays and implicit solvers
- Roofline analysis enables performance prediction and targeted optimzation

Further Work

- Build up a "calculus" of \(W_\text{FLOP}\) estimation
- Develop confidence in estimating model bandwidth

- Nvidia Hopper/Ampere have even greater bandwidth (> 2-3TB/s?)
- Fujitsu's A64FX: Combine cache CPU with HBM memory?