Regular job + slight progress on ZeRO-2/3

mle
python
Fixing prod issues at Kivala, working on a design overhaul of the product + progress on ZeRO-2 and 3
Published

September 22, 2025

Regular Job

We’ve had some issues with our SIM provider resulting in intermittent loss of network on our edge devices, so I’m working on issues to mitigate and enhance our resilience infra.

As a result of developping a new product from scratch in-house, we’ve worked on a design overhaul of the entire intercom system, which I’ve already fully implemented. But we’ve decided to ship this update to our existing system. That’s running on a completely different tech stack, so I have to duplicate the work and effort, leading to little time for ML.

ZeRO-2/3

Not much to add. I’ve begun writing my toy FSDP implementation, not much done yet. I did hit a wall trying to run the same script on my macbook pro and on Modal. torchrun would fail with cryptic issues about IPv6, here’s the line to run the script working:

GLOO_SOCKET_FAMILY=inet PYTORCH_ENABLE_MPS_FALLBACK=1 uv run -m torch.distributed.run --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 --nproc_per_node=2 <script.py>