ZeRO-2 + Arctic Sequence Length Training presentation by Tunji Ruwase from Snowflake
ZeRo-1 -> ZeRO-2
Well, today I realized my toy implementation of ZeRO-1 was not very scalable, I flattened the optimizer states and split them so that each rank owned complete state tensors. This is no-bueno because very unbalanced, for 2 GPUs, rank 0 owned 99% of the total size, for 3 optimizer state tensors, and rank 1 owned 1% of the total size, for 3 state tensors too! Furthermore, it is a poor abstraction for my implementation of ZeRO-2. Now I’m flattening the weights, and sharding the tensors themselves.
ASLT by Tunji Ruwase
Once again, as part of Scratch To Scale, we’ve had another wonderful lecture by yet another industry expert, Tunji Ruwase from Snowflake, previously from Microsoft DeepSpeed! This technique allows the tiling of sequences to expand training to much longer sequences that would otherwise OOM. I’m not gonna lie, the level of these lectures is slightly out of my current knowledge frontier, but it’s great because it forces me to pull knowledge on-demande and learn very fast.