Going through DDP with S2S
Finishing the preliminaries for D2L
Well, finally I’m done with the “catching up” for D2L, with a good refresher on Probabilities. Since I studied stats and probabilities at university, I didn’t learn anything new here, but it’s always good to refresh some knowledge, and re-derive formulas, to have them in mind and grok or accept them as true. Like the classic Bayes Theorem, I always accepted the formula as a way to flip conditional probabilities, but rarely as a way of updating a prior.
Evening study session on DDP
We discussed the different types of parallelism (for ML distributed training):
- (Distributed) Data Parallelism (DDP): Each rank loads a copy — replica — of the model, after each optimizer step they always all have the same parameters, they are replicants. Each rank then trains on a different mini-batch (hence the importance of data sharding). We then average the gradients, perform a step of gradient descent, rinse and repeat. If we can use this, we should, it has the least amount of overhead, but it requires that the model + optimizer states all fit in the device’s VRAM.
- Pipeline Parallelism (PP): We split the model across different ranks without splitting the layers (so we split along the layers). That is inter-layer parallelism. An exaggeration would be having a model with 2 hidden layers and 1 output layer split across 3 GPUs.
- Tensor Parallelism (TP): We split the layers of the model across different ranks, that’s intra-layer parallelism. This could be useful if some layers are so large they don’t even fit in a single device. To reduce overhead, it’s advisable that even if a layer is split across different ranks, all parts remain on the same node. (See terminology)
There’s also
- Expert Parallelism (EP): For Mixture-of-Experts (MoE) we can split the experts on different devices.
We didn’t discuss much this last one but I knew about it already and searched a bit to know more about it.
Last but not least we discussed the issue of DataLoaders / Samplers, and the importance of data sharding. They are a mean of “splitting” efficiently our dataset so that each rank sees a different mini-batch of data. Also, each process only loads its shard so it’s more memory/communication efficient.