Environment API¶
TorchRL offers an API to handle environments of different backends, such as gym,
dm-control, dm-lab, model-based environments as well as custom environments.
The goal is to be able to swap environments in an experiment with little or no effort,
even if these environments are simulated using different libraries.
TorchRL offers some out-of-the-box environment wrappers under torchrl.envs.libs,
which we hope can be easily imitated for other libraries.
The parent class EnvBase is a torch.nn.Module subclass that implements
some typical environment methods using tensordict.TensorDict as a data organiser. This allows this
class to be generic and to handle an arbitrary number of input and outputs, as well as
nested or batched data structures.
Each env will have the following attributes:
env.batch_size: atorch.Sizerepresenting the number of envs batched together.env.device: the device where the input and output tensordict are expected to live. The environment device does not mean that the actual step operations will be computed on device (this is the responsibility of the backend, with which TorchRL can do little). The device of an environment just represents the device where the data is to be expected when input to the environment or retrieved from it. TorchRL takes care of mapping the data to the desired device. This is especially useful for transforms (see below). For parametric environments (e.g. model-based environments), the device does represent the hardware that will be used to compute the operations.env.observation_spec: aCompositeobject containing all the observation key-spec pairs.env.state_spec: aCompositeobject containing all the input key-spec pairs (except action). For most stateful environments, this container will be empty.env.action_spec: aTensorSpecobject representing the action spec.env.reward_spec: aTensorSpecobject representing the reward spec.env.done_spec: aTensorSpecobject representing the done-flag spec. See the section on trajectory termination below.env.input_spec: aCompositeobject containing all the input keys ("full_action_spec"and"full_state_spec").env.output_spec: aCompositeobject containing all the output keys ("full_observation_spec","full_reward_spec"and"full_done_spec").
If the environment carries non-tensor data, a NonTensor
instance can be used.
Env specs: locks and batch size¶
Environment specs are locked by default (through a spec_locked arg passed to the env constructor).
Locking specs means that any modification of the spec (or its children if it is a Composite
instance) will require to unlock it. This can be done via the set_spec_lock_().
The reason specs are locked by default is that it makes it easy to cache values such as action or reset keys and the
likes.
Unlocking an env should only be done if it expected that the specs will be modified often (which, in principle, should
be avoided).
Modifications of the specs such as env.observation_spec = new_spec are allowed: under the hood, TorchRL will erase
the cache, unlock the specs, make the modification and relock the specs if the env was previously locked.
Importantly, the environment spec shapes should contain the batch size, e.g.
an environment with env.batch_size == torch.Size([4]) should have
an env.action_spec with shape torch.Size ([4, action_size]).
This is helpful when preallocation tensors, checking shape consistency etc.
Env methods¶
With these, the following methods are implemented:
env.reset(): a reset method that may (but not necessarily requires to) take atensordict.TensorDictinput. It return the first tensordict of a rollout, usually containing a"done"state and a set of observations. If not present, a"reward"key will be instantiated with 0s and the appropriate shape.env.step(): a step method that takes atensordict.TensorDictinput containing an input action as well as other inputs (for model-based or stateless environments, for instance).env.step_and_maybe_reset(): executes a step, and (partially) resets the environments if it needs to. It returns the updated input with a"next"key containing the data of the next step, as well as a tensordict containing the input data for the next step (ie, reset or result orstep_mdp()) This is done by reading thedone_keysand assigning a"_reset"signal to each done state. This method allows to code non-stopping rollout functions with little effort:>>> data_ = env.reset() >>> result = [] >>> for i in range(N): ... data, data_ = env.step_and_maybe_reset(data_) ... result.append(data) ... >>> result = torch.stack(result)
env.set_seed(): a seeding method that will return the next seed to be used in a multi-env setting. This next seed is deterministically computed from the preceding one, such that one can seed multiple environments with a different seed without risking to overlap seeds in consecutive experiments, while still having reproducible results.env.rollout(): executes a rollout in the environment for a maximum number of steps (max_steps=N) and using a policy (policy=model). The policy should be coded using atensordict.nn.TensorDictModule(or any othertensordict.TensorDict-compatible module). The resultingtensordict.TensorDictinstance will be marked with a trailing"time"named dimension that can be used by other modules to treat this batched dimension as it should.
The following figure summarizes how a rollout is executed in torchrl.
TorchRL rollouts using TensorDict.¶
In brief, a TensorDict is created by the reset() method,
then populated with an action by the policy before being passed to the
step() method which writes the observations, done flag(s) and
reward under the "next" entry. The result of this call is stored for
delivery and the "next" entry is gathered by the step_mdp()
function.
Note
In general, all TorchRL environment have a "done" and "terminated"
entry in their output tensordict. If they are not present by design,
the EnvBase metaclass will ensure that every done or terminated
is flanked with its dual.
In TorchRL, "done" strictly refers to the union of all the end-of-trajectory
signals and should be interpreted as “the last step of a trajectory” or
equivalently “a signal indicating the need to reset”.
If the environment provides it (eg, Gymnasium), the truncation entry is also
written in the EnvBase.step() output under a "truncated" entry.
If the environment carries a single value, it will interpreted as a "terminated"
signal by default.
By default, TorchRL’s collectors and rollout methods will be looking for the "done"
entry to assess if the environment should be reset.
Note
The torchrl.collectors.utils.split_trajectories function can be used to
slice adjacent trajectories. It relies on a "traj_ids" entry in the
input tensordict, or to the junction of "done" and "truncated" key
if the "traj_ids" is missing.
Note
In some contexts, it can be useful to mark the first step of a trajectory.
TorchRL provides such functionality through the InitTracker
transform.
Our environment tutorial provides more information on how to design a custom environment from scratch.
Base classes¶
|
Abstract environment parent class. |
|
A gym-like env is an environment. |
|
A class for environment meta-data storage and passing in multiprocessed settings. |
Custom native TorchRL environments¶
TorchRL offers a series of custom built-in environments.
|
A chess environment that follows the TorchRL API. |
|
A stateless Pendulum environment. |
|
A Tic-Tac-Toe implementation. |
|
A text generation environment that uses a hashing module to identify unique observations. |
Domain-specific¶
|
Basic environment for Model Based RL sota-implementations. |
|
Dreamer simulation environment. |
A transform to record the decoded observations in Dreamer. |
Helpers¶
|
A random policy for data collectors. |
|
Tests an environment specs against the results of short rollout. |
Returns the current sampling type. |
|
Returns all the supported libraries. |
|
|
Creates a Composite instance from a tensordict, assuming all values are unbounded. |
alias of |
|
|
Creates a new tensordict that reflects a step in time of the input tensordict. |
|
Reads the done / terminated / truncated keys within a tensordict, and writes a new tensor where the values of both signals are aggregated. |