Shortcuts

make_async_vllm_engine

class torchrl.modules.llm.make_async_vllm_engine(*, model_name: str, num_devices: int | None = None, num_replicas: int = 1, verbose: bool = True, compile: bool = True, enable_fp32_output: bool = False, tensor_parallel_size: int | None = None, data_parallel_size: int | None = None, pipeline_parallel_size: int | None = None, **kwargs)[source]

Create an async vLLM engine service.

Keyword Arguments:
  • model_name (str) – The model name to pass to vLLM.

  • num_devices (int, optional) – Number of devices to use, per replica.

  • num_replicas (int) – Number of engine replicas to create.

  • verbose (bool, optional) – Whether to enable verbose logging with throughput statistics. Defaults to True.

  • compile (bool, optional) – Whether to enable model compilation for better performance. Defaults to True.

  • enable_fp32_output (bool, optional) – Whether to enable FP32 output for the final layer. Defaults to False. This can help with numerical stability for certain models. Requires model-specific support in torchrl.modules.llm.backends._models.

  • tensor_parallel_size (int, optional) – Number of devices to use, per replica. Defaults to None.

  • data_parallel_size (int, optional) – Number of data parallel groups to use. Defaults to None.

  • pipeline_parallel_size (int, optional) – Number of pipeline parallel groups to use. Defaults to None.

  • **kwargs – Additional arguments passed to AsyncEngineArgs.

Returns:

The launched engine service.

Return type:

AsyncVLLM

Raises:
  • RuntimeError – If no CUDA devices are available.

  • ValueError – If invalid device configuration is provided.

Example

>>> # Create a single-GPU async engine
>>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B")
>>>
>>> # Create a 2-GPU tensor parallel async engine with 2 replicas
>>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B", num_devices=2, num_replicas=2)
>>> # Generate text
>>> result = service.generate("Hello, world!", sampling_params)
>>>
>>> # Create with FP32 output enabled
>>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B", enable_fp32_output=True)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources