# Basics of Data-Parallelism and Scaling

## The Message Passing Interface

The Message Passing Interface (MPI) is the backbone of most data-parallel high performance and scientific computing applications. MPI enables scientific software developers to write applications in C/C++, Fortran, and Python that can coordinate any number of processes across any number of servers.

In data-parallel applications with MPI, programs are written assuming that individual processes, called MPI Ranks, execute the same set of instructions, but operate on a different, often unique, subset of data. In this approach it is often the case that each process needs to be made occasionally aware of another process's data.

## Data-Parallelism Example

For example, in smooth particle hydrodynamics (SPH), a fluid is described as a large number of individual particles that interact according to the laws of physics. In each iteration of an SPH model, a program calculates how that particle's velocity, temperature, density, and position will change. Each fluid particle's tendency for change depends on the state of nearby particles. An over-simplified pseudo-code for the serial (non-parallel) implementation might look like the following :

``DO iter = 1, n_iterates``
``  DO i=1,n_particles``

``    # Calculate the particle tendency using the current particle states``
``    particle_tendency = Calculate_Particle_Tendency(particle_states)``
`` ``
``    # Update the particle_states with the new particle tendency``
``    particle_states(i) = Update_Particle_States(particle_tendency)``

``  END DO``
``END DO``

In this pseudo-code,` Calculate_Particle_Tendency` uses all of the current particle states to calculate the tendency for each particle. `Update_Particle_States` then uses the `particle_tendency` for particle i to update the particle state.

In a data-parallel approach, the loop over n_particles, can be shared across many independent processes. For the sake of illustration, let's say there are 2 processes. Then, each process will update `n_particles/2` fluid particles. Ideally, if each process ran independently, we could shorten the amount of time needed to run this simulation by a factor of 2. However, with each rank not aware of all of the particle states, we need to add a step for each rank to share with the other it's particle states. The data-parallel algorithm now looks like

``DO iter = 1, n_iterates``

``  # This rank shares its state with the other MPI ranks, and receives the particle states from other MPI ranks``
``  other_particle_states = Exchange_Particle_States(particle_states)``

``  DO i=1,n_particles/2``

``    # Calculate the particle tendency using the current particle states``
``    particle_tendency = Calculate_Particle_Tendency(particle_states, other_particle_states)``
`` ``
``    # Update the particle_states with the new particle tendency``
``    particle_states(i) = Update_Particle_States(particle_tendency)``

``  END DO``
``END DO``

This extra step, Exchange_Particle_States, is where all of the MPI calls would exist. This MPI exchange adds some overhead to the application. We now expect the application run-time to be half of the serial run-time, plus a little extra for the processes to exchange information.

## Strong and Weak Scaling

Now, what happens when we increase the number of MPI ranks to 3, 4, ...., 1000 ? Or what can I expect my run-time to be when I increase the problem size (by increasing n_particles) ? These questions are alternative phrasings of "How does my application behave under strong scaling?" and "How does my application behave under weak scaling?".

### Strong Scaling

In strong scaling, we are trying to understand how the program's run-time changes under a fixed problem size, while increasing the number of MPI ranks. As we saw in the SPH example, a section of the code is executed in parallel, but we have to exchange data between MPI ranks. As we add more MPI ranks, for a fixed problem size, eventually the parallel section of the code (the particle tendency calculation and updates) will become less expensive than the communication overhead.

One way to think about this scaling question is to consider breaking the total run-time into two components

1. The time spent in MPI communication operations (T_mpi)
2. The time spent executing instructions in parallel (T_exe)

T = T_mpi + T_exe

I'll make the assumption that the execution time in parallel with N MPI ranks is related to the serial run-time (T_serial) as

T_exe = T_serial / N

Runtime

To illustrate how this works, I've included a figure that shows the total runtime and the contributions from T_exe and T_mpi. Here, I've assumed that T_mpi increases linearly with the number of MPI ranks.

In this example, when the number of ranks is small, the total runtime decreases fairly rapidly and behaves like T_exe. However, at some point, the amount of time spent in MPI communication overhead becomes more expensive and eventually the total runtime begins to grow again and we lose the benefits of parallelization.

Speedup

The speedup is a measure of how much faster the parallel application is relative to the serial program, and it is calculate as the ratio of the serial runtime over the parallel runtime

S = T_serial / T

In a strong scaling study, the speedup maximum tells us, for a given problem size, how many MPI ranks we would want to use to obtain the best possible time-to-solution.

In the most ideal scenario, putting N MPI ranks to work on a problem would result in a speedup of N. In practice, T_mpi increases with N while T_exe decreases with N. Because of these factors, the speedup eventually "tops out" and we reach a point of no return.

### Weak Scaling

In weak scaling, we are trying to understand how the program's runtime changes under changing problem sizes, while commensurately increasing the number of MPI ranks. In the SPH example, you can think of increasing n_particles by a factor of 2, while increasing the number of MPI ranks by a factor of 2.

As before, we'll break the total runtime into two components

1. The time spent in MPI communication operations (T_mpi)
2. The time spent executing instructions in parallel (T_exe)

T = T_mpi + T_exe

This time, I'll make the assumptions that T_exe is constant with increasing N, but T_mpi increases linearly with N. It's reasonable to assume that T_exe is constant in weak scaling since the number of MPI ranks increases. However, making this assumption implies that the amount of work per MPI rank stays fixed throughout the weak scaling study. The assumption that T_mpi increases linearly with N is purely illustrative here;

Scaling efficiency

In an ideal scenario, T_mpi would be constant as the problem size and number of MPI ranks increased. Applications with localized "nearest neighbor" communication patterns tend to exhibit this behavior. Applications with global communication patterns tend to incur more communication overhead with larger problem sizes and more MPI ranks.

In weak scaling studies, the scaling efficiency is the ratio of the serial runtime over the parallel runtime with the larger problem size.

E = T_serial/T

Although this is is identical in appearance to speedup in strong scaling, this metric is interpreted as the scaling efficiency when applied to the weak scaling premise. Qualitatively, if T=T_serial as N increases, we achieve an efficiency of 1 (100 % efficiency). As T increases, the efficiency decreases.

## Measuring Speedup and Scaling Efficiency in practice

These notes are really meant to provide very simple mental models that can help you interpret parallel application run-time measurements. When engaging in a weak or strong scaling study, teams will usually pick a prototype problem (called a benchmark problem) for their application that runs in an hour or less. They will also develop and document a strategy for increasing the number of MPI ranks and/or the problem size. In each simulation, the total run-time will be recorded so that the speedup and scaling efficiency can be estimated.

This information is very often combined with application profiling and hotspot analysis. Profiling is meant to give an accounting of the subroutines/functions/instructions that contribute to the total runtime. Hotspot analysis is focused on reporting the most expensive routines and determining what about those routines is causing the application to spend the most time in those locations. When poor scaling is observed, profiling and hotspot analysis can help uncover inefficient communication patterns and other performance limiters in HPC applications.