LocalSGD¶
This module implements a fault tolerant version of LocalSGD and related methods.
- class torchft.local_sgd.DiLoCo(manager: Manager, model: Module, inner_optimizer: Optimizer, outer_optimizer: Optimizer, sync_every: int, backup_device: Optional[device] = None, pin_memory: bool = True, use_bucketization: bool = False, bucket_cap_mb: Optional[int] = None)[source]¶
Bases:
object
DiLoCo is a subclass of LocalSGD that overrides the synchronization mechanism to average and synchronize the pseudogradients (delta of the previous global weight and current local weights).
This algorithm requires a backup copy of the weights. By default these are stored in CPU memory. If any error occurs during the DiLoCo step, the step will be discarded and the model parameters will reset back to the last time DiLoCo synchronized.
DiLoCo paper: https://arxiv.org/pdf/2311.08105
- bucket_cap_mb: int = 33554432¶
- bucketize_and_allreduce(tensors: List[Tensor], bucket_size_bytes: int) None [source]¶
Applies allreduce on a list of tensors using bucketization.
- Parameters:
tensors – List of torch tensors (e.g., gradients).
bucket_size_bytes – Max size of each bucket in bytes.
- use_bucketization: bool = False¶
- class torchft.local_sgd.LocalSGD(manager: Manager, model: Module, optimizer: Optimizer, sync_every: int)[source]¶
Bases:
object
LocalSGD is a context manager that implements the algorithm described in https://arxiv.org/pdf/1805.09767
This will synchronize the model parameters periodically in a fault tolerant way using a torchft Manager. The allreduce on the parameters will happen every sync_every steps after the optimizer.step call.
The torchft quorum is computed at the beginning of
sync_every
steps. If any error occurs, or a worker fails between syncs,sync_every
steps will be discarded and a new quorum will be computed on the next step.If running in async mode, on a joining worker the first
sync_every
steps will discarded as the model will be recovering during that period. When using sync mode, the checkpoint will be restored prior to the first step.