Going through DDP with S2S

python

mle

math

Finishing the preliminaries for D2L and quick evening study session on DDP with torch.distributed primitives

Published

September 4, 2025

Finishing the preliminaries for D2L

Well, finally I’m done with the “catching up” for D2L, with a good refresher on Probabilities. Since I studied stats and probabilities at university, I didn’t learn anything new here, but it’s always good to refresh some knowledge, and re-derive formulas, to have them in mind and grok or accept them as true. Like the classic Bayes Theorem, I always accepted the formula as a way to flip conditional probabilities, but rarely as a way of updating a prior.

Evening study session on DDP

We discussed the different types of parallelism (for ML distributed training):

(Distributed) Data Parallelism (DDP): Each rank loads a copy — replica — of the model, after each optimizer step they always all have the same parameters, they are replicants. Each rank then trains on a different mini-batch (hence the importance of data sharding). We then average the gradients, perform a step of gradient descent, rinse and repeat. If we can use this, we should, it has the least amount of overhead, but it requires that the model + optimizer states all fit in the device’s VRAM.
Pipeline Parallelism (PP): We split the model across different ranks without splitting the layers (so we split along the layers). That is inter-layer parallelism. An exaggeration would be having a model with 2 hidden layers and 1 output layer split across 3 GPUs.
Tensor Parallelism (TP): We split the layers of the model across different ranks, that’s intra-layer parallelism. This could be useful if some layers are so large they don’t even fit in a single device. To reduce overhead, it’s advisable that even if a layer is split across different ranks, all parts remain on the same node. (See terminology)

There’s also

Expert Parallelism (EP): For Mixture-of-Experts (MoE) we can split the experts on different devices.

We didn’t discuss much this last one but I knew about it already and searched a bit to know more about it.

Last but not least we discussed the issue of DataLoaders / Samplers, and the importance of data sharding. They are a mean of “splitting” efficiently our dataset so that each rank sees a different mini-batch of data. Also, each process only loads its shard so it’s more memory/communication efficient.