RayLLMCollector¶
- class torchrl.collectors.llm.RayLLMCollector(env: EnvBase | Callable[[], EnvBase], *, policy: Callable[[TensorDictBase], TensorDictBase] | None = None, policy_factory: Callable[[], Callable[[TensorDictBase], TensorDictBase]] | None = None, dialog_turns_per_batch: int, total_dialog_turns: int = - 1, yield_only_last_steps: bool | None = None, yield_completed_trajectories: bool | None = None, postproc: Callable[[TensorDictBase], TensorDictBase] | None = None, async_envs: bool | None = None, replay_buffer: ReplayBuffer | None = None, reset_at_each_iter: bool = False, flatten_data: bool | None = None, weight_updater: WeightUpdaterBase | Callable[[], WeightUpdaterBase] | None = None, ray_init_config: dict[str, Any] | None = None, remote_config: dict[str, Any] | None = None, track_policy_version: bool | PolicyVersion = False, sync_iter: bool = True, verbose: bool = False)[source]¶
A lightweight Ray implementation of the LLM Collector that can be extended and sampled remotely.
- Parameters:
env (EnvBase or EnvBase constructor) – the environment to be used for data collection.
- Keyword Arguments:
policy (Callable[[TensorDictBase], TensorDictBase]) – the policy to be used for data collection.
policy_factory (Callable[[], Callable], optional) – a callable that returns a policy instance. This is exclusive with the policy argument.
dialog_turns_per_batch (int) – A keyword-only argument representing the total number of elements in a batch.
total_dialog_turns (int) – A keyword-only argument representing the total number of dialog turns returned by the collector during its lifespan.
yield_only_last_steps (bool, optional) – whether to yield every step of a trajectory, or only the last (done) steps.
yield_completed_trajectories (bool, optional) – whether to yield batches of rollouts with a given number of steps or single, completed trajectories.
postproc (Callable, optional) – A post-processing transform.
async_envs (bool, optional) – if True, the environment will be run asynchronously.
replay_buffer (ReplayBuffer, optional) – if provided, the collector will not yield tensordicts but populate the buffer instead.
reset_at_each_iter (bool, optional) – if True, the environment will be reset at each iteration.
flatten_data (bool, optional) – if True, the collector will flatten the collected data before returning it.
weight_updater (WeightUpdaterBase or constructor, optional) – An instance of WeightUpdaterBase or its subclass, responsible for updating the policy weights on remote inference workers.
ray_init_config (dict[str, Any], optional) – keyword arguments to pass to ray.init().
remote_config (dict[str, Any], optional) – keyword arguments to pass to cls.as_remote().
sync_iter (bool, optional) –
if True, items yeilded by the collector will be synced to the local process. If False, the collector will collect the next batch of data in between yielding. This has no effect when data is collected through the
start()
method. For example:>>> collector = RayLLMCollector(..., sync_iter=True) >>> for data in collector: # blocking ... # expensive operation - collector is idle >>> collector = RayLLMCollector(..., sync_iter=False) >>> for data in collector: # non-blocking ... # expensive operation - collector is collecting data
This is somehwat equivalent to using
MultiSyncDataCollector
(sync_iter=True) orMultiAsyncDataCollector
(sync_iter=False). Defaults to True.verbose (bool, optional) – if
True
, the collector will print progress information. Defaults to False.
- classmethod as_remote(remote_config: dict[str, Any] | None = None)¶
Creates an instance of a remote ray class.
- Parameters:
cls (Python Class) – class to be remotely instantiated.
remote_config (dict) – the quantity of CPU cores to reserve for this class.
- Returns:
A function that creates ray remote class instances.
- property dialog_turns_per_batch: int¶
Number of dialog turns per batch.
- get_policy_model()¶
Get the policy model.
This method is used by RayLLMCollector to get the remote LLM instance for weight updates.
- Returns:
The policy model instance
- get_policy_version() str | int | None ¶
Get the current policy version.
This method exists to support remote calls in Ray actors, since properties cannot be accessed directly through Ray’s RPC mechanism.
- Returns:
The current version number (int) or UUID (str), or None if version tracking is disabled.
- init_updater(*args, **kwargs)[source]¶
Initialize the weight updater with custom arguments.
This method calls init_updater on the remote collector.
- Parameters:
*args – Positional arguments for weight updater initialization
**kwargs – Keyword arguments for weight updater initialization
- is_initialized() bool ¶
Check if the collector is initialized and ready.
- Returns:
True if the collector is initialized and ready to collect data.
- Return type:
bool
- iterator() Iterator[TensorDictBase] ¶
Iterates through the DataCollector.
Yields: TensorDictBase objects containing (chunks of) trajectories
- load_state_dict(state_dict: OrderedDict, **kwargs) None ¶
Loads a state_dict on the environment and policy.
- Parameters:
state_dict (OrderedDict) – ordered dictionary containing the fields “policy_state_dict” and
"env_state_dict"
.
- next() None [source]¶
Get the next batch of data from the collector.
- Returns:
None as the data is written directly to the replay buffer.
- pause()¶
Context manager that pauses the collector if it is running free.
- property policy_version: str | int | None¶
The current version of the policy.
- Returns:
The current version number (int) or UUID (str), or None if version tracking is disabled.
- reset(index=None, **kwargs) None ¶
Resets the environments to a new initial state.
- property rollout: Callable[[], TensorDictBase]¶
Returns the rollout function.
- set_seed(seed: int, static_seed: bool = False) int ¶
Sets the seeds of the environments stored in the DataCollector.
- Parameters:
seed (int) – integer representing the seed to be used for the environment.
static_seed (bool, optional) – if
True
, the seed is not incremented. Defaults to False
- Returns:
Output seed. This is useful when more than one environment is contained in the DataCollector, as the seed will be incremented for each of these. The resulting seed is the seed of the last environment.
Examples
>>> from torchrl.envs import ParallelEnv >>> from torchrl.envs.libs.gym import GymEnv >>> from tensordict.nn import TensorDictModule >>> from torch import nn >>> env_fn = lambda: GymEnv("Pendulum-v1") >>> env_fn_parallel = ParallelEnv(6, env_fn) >>> policy = TensorDictModule(nn.Linear(3, 1), in_keys=["observation"], out_keys=["action"]) >>> collector = SyncDataCollector(env_fn_parallel, policy, total_frames=300, frames_per_batch=100) >>> out_seed = collector.set_seed(1) # out_seed = 6
- state_dict() OrderedDict ¶
Returns the local state_dict of the data collector (environment and policy).
- Returns:
an ordered dictionary with fields
"policy_state_dict"
and “env_state_dict”.
- property total_dialog_turns¶
Total number of dialog turns to collect.
- update_policy_weights_(policy_or_weights: TensorDictBase | TensorDictModuleBase | dict | None = None, *, worker_ids: torch.device | int | list[int] | list[torch.device] | None = None, **kwargs)[source]¶
Updates the policy weights on remote workers.
- Parameters:
policy_or_weights – The weights to update with. Can be: - TensorDictModuleBase: A policy module whose weights will be extracted - TensorDictBase: A TensorDict containing weights - dict: A regular dict containing weights - None: Will try to get weights from server using _get_server_weights()
worker_ids – The workers to update. If None, updates all workers.
- property weight_updater: WeightUpdaterBase¶
The weight updater instance.
We can pass the weight updater because it’s stateless, hence serializable.