Shortcuts

make_async_vllm_engine

class torchrl.modules.llm.make_async_vllm_engine(model_name: str, num_devices: int | None = None, num_replicas: int = 1, verbose: bool = True, compile: bool = True, **kwargs)[source]

Create an async vLLM engine service.

Parameters:
  • model_name (str) – The model name to pass to vLLM.

  • num_devices (int, optional) – Number of devices to use, per replica.

  • num_replicas (int) – Number of engine replicas to create.

  • verbose (bool, optional) – Whether to enable verbose logging with throughput statistics. Defaults to True.

  • compile (bool, optional) – Whether to enable model compilation for better performance. Defaults to True.

  • **kwargs – Additional arguments passed to AsyncEngineArgs.

Returns:

The launched engine service.

Return type:

AsyncVLLM

Raises:
  • RuntimeError – If no CUDA devices are available.

  • ValueError – If invalid device configuration is provided.

Example

>>> # Create a single-GPU async engine
>>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B")
>>>
>>> # Create a 2-GPU tensor parallel async engine with 2 replicas
>>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B", num_devices=2, num_replicas=2)
>>> # Generate text
>>> result = service.generate("Hello, world!", sampling_params)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources