Welcome to the TorchSnapshot documentation! =========================================== TorchSnapshot is a PyTorch library for adding fault tolerance to large-scale PyTorch distributed training workloads. `Installation instructions `_ TorchSnapshot API ----------------- .. toctree:: :maxdepth: 2 :caption: Contents: getting_started.rst api_reference.rst Examples -------- * `Simple example `_ * `Using TorchSnapshot with DistributedDataParallel (DDP) `_ * `Using TorchSnapshot with TorchRec `_