SGLangCollectiveTransport¶
- class torchrl.weight_update.llm.SGLangCollectiveTransport(server_url: str, master_address: str, master_port: int, rank: int, world_size: int, device: device | str | int | None = None, timeout: float = 300.0)[source]¶
Transport for SGLang using NCCL collective communication.
This transport coordinates with SGLang servers via HTTP and performs weight transfer via NCCL broadcast.
- Parameters:
server_url – URL of the SGLang server.
master_address – Address for NCCL initialization.
master_port – Port for NCCL initialization.
rank – Rank of this process (0 for trainer).
world_size – Total number of processes.
device – Device to use for communication.
timeout – HTTP request timeout in seconds.
- init_all_workers_group(model_metadata: dict[str, tuple[dtype, Size]]) None[source]¶
Initialize the NCCL communication group.
For the trainer (rank 0), this: 1. Signals the SGLang server via HTTP to join the NCCL group 2. Initializes the trainer’s NCCL communicator
- Parameters:
model_metadata – Dict mapping param names to (dtype, shape) tuples.