SGLangCollectiveTransport¶

class torchrl.weight_update.llm.SGLangCollectiveTransport(server_url: str, master_address: str, master_port: int, rank: int, world_size: int, device: device | str | int | None = None, timeout: float = 300.0)[source]¶

Transport for SGLang using NCCL collective communication.

This transport coordinates with SGLang servers via HTTP and performs weight transfer via NCCL broadcast.

Parameters:

server_url – URL of the SGLang server.
master_address – Address for NCCL initialization.
master_port – Port for NCCL initialization.
rank – Rank of this process (0 for trainer).
world_size – Total number of processes.
device – Device to use for communication.
timeout – HTTP request timeout in seconds.

check_connection() → bool[source]¶: Check if the communication group is initialized.

init_all_workers_group(model_metadata: dict[str, tuple[dtype, Size]]) → None[source]¶

Initialize the NCCL communication group.

For the trainer (rank 0), this: 1. Signals the SGLang server via HTTP to join the NCCL group 2. Initializes the trainer’s NCCL communicator

Parameters:: model_metadata – Dict mapping param names to (dtype, shape) tuples.

send_weights(model_id: str, weights: dict[str, Tensor]) → None[source]¶

Broadcast weights to SGLang server via NCCL.

Parameters:

model_id – Identifier for the model (for logging).
weights – Dict mapping parameter names to tensors.

SGLangCollectiveTransport¶

Docs

Tutorials

Resources