Training a frontier AI model means keeping thousands of GPUs synchronized for weeks on end. When a single network link fails, ...