:github_url: https://github.com/pytorch-labs/torchft torchft ======== This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job. **GETTING STARTED?** See Install and Usage in `the README `_. .. toctree:: :maxdepth: 1 :caption: Design protocol assumptions_and_recommendations .. toctree:: :maxdepth: 2 :caption: API Reference api License --------- torchft is BSD 3-Clause licensed. See `LICENSE `_ for more details. Copyright © Meta Platforms, Inc * `Terms of Use `_ * `Privacy Policy `_