make_async_vllm_engine¶
- class torchrl.modules.llm.make_async_vllm_engine(model_name: str, num_devices: int | None = None, num_replicas: int = 1, verbose: bool = True, compile: bool = True, **kwargs)[source]¶
Create an async vLLM engine service.
- Parameters:
model_name (str) – The model name to pass to vLLM.
num_devices (int, optional) – Number of devices to use, per replica.
num_replicas (int) – Number of engine replicas to create.
verbose (bool, optional) – Whether to enable verbose logging with throughput statistics. Defaults to True.
compile (bool, optional) – Whether to enable model compilation for better performance. Defaults to True.
**kwargs – Additional arguments passed to AsyncEngineArgs.
- Returns:
The launched engine service.
- Return type:
- Raises:
RuntimeError – If no CUDA devices are available.
ValueError – If invalid device configuration is provided.
Example
>>> # Create a single-GPU async engine >>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B") >>> >>> # Create a 2-GPU tensor parallel async engine with 2 replicas >>> service = make_async_vllm_engine("Qwen/Qwen2.5-3B", num_devices=2, num_replicas=2) >>> # Generate text >>> result = service.generate("Hello, world!", sampling_params)