.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/getting-started-1.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_getting-started-1.py: Get started with TorchRL's modules ================================== **Author**: `Vincent Moens `_ .. _gs_modules: .. note:: To run this tutorial in a notebook, add an installation cell at the beginning containing: .. code-block:: !pip install tensordict !pip install torchrl .. GENERATED FROM PYTHON SOURCE LINES 19-46 Reinforcement Learning is designed to create policies that can effectively tackle specific tasks. Policies can take various forms, from a differentiable map transitioning from the observation space to the action space, to a more ad-hoc method like an argmax over a list of values computed for each possible action. Policies can be deterministic or stochastic, and may incorporate complex elements such as Recurrent Neural Networks (RNNs) or transformers. Accommodating all these scenarios can be quite intricate. In this succinct tutorial, we will delve into the core functionality of TorchRL in terms of policy construction. We will primarily focus on stochastic and Q-Value policies in two common scenarios: using a Multi-Layer Perceptron (MLP) or a Convolutional Neural Network (CNN) as backbones. TensorDictModules ----------------- Similar to how environments interact with instances of :class:`~tensordict.TensorDict`, the modules used to represent policies and value functions also do the same. The core idea is simple: encapsulate a standard :class:`~torch.nn.Module` (or any other function) within a class that knows which entries need to be read and passed to the module, and then records the results with the assigned entries. To illustrate this, we will use the simplest policy possible: a deterministic map from the observation space to the action space. For maximum generality, we will use a :class:`~torch.nn.LazyLinear` module with the Pendulum environment we instantiated in the previous tutorial. .. GENERATED FROM PYTHON SOURCE LINES 46-60 .. code-block:: Python import torch from tensordict.nn import TensorDictModule from torchrl.envs import GymEnv env = GymEnv("Pendulum-v1") module = torch.nn.LazyLinear(out_features=env.action_spec.shape[-1]) policy = TensorDictModule( module, in_keys=["observation"], out_keys=["action"], ) .. GENERATED FROM PYTHON SOURCE LINES 61-65 This is all that's required to execute our policy! The use of a lazy module allows us to bypass the need to fetch the shape of the observation space, as the module will automatically determine it. This policy is now ready to be run in the environment: .. GENERATED FROM PYTHON SOURCE LINES 65-69 .. code-block:: Python rollout = env.rollout(max_steps=10, policy=policy) print(rollout) .. rst-class:: sphx-glr-script-out .. code-block:: none TensorDict( fields={ action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), next: TensorDict( fields={ done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False), reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False), observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False) .. GENERATED FROM PYTHON SOURCE LINES 70-83 Specialized wrappers -------------------- To simplify the incorporation of :class:`~torch.nn.Module`s into your codebase, TorchRL offers a range of specialized wrappers designed to be used as actors, including :class:`~torchrl.modules.tensordict_module.Actor`, # :class:`~torchrl.modules.tensordict_module.ProbabilisticActor`, # :class:`~torchrl.modules.tensordict_module.ActorValueOperator` or # :class:`~torchrl.modules.tensordict_module.ActorCriticOperator`. For example, :class:`~torchrl.modules.tensordict_module.Actor` provides default values for the ``in_keys`` and ``out_keys``, making integration with many common environments straightforward: .. GENERATED FROM PYTHON SOURCE LINES 83-90 .. code-block:: Python from torchrl.modules import Actor policy = Actor(module) rollout = env.rollout(max_steps=10, policy=policy) print(rollout) .. rst-class:: sphx-glr-script-out .. code-block:: none TensorDict( fields={ action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), next: TensorDict( fields={ done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False), reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False), observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False) .. GENERATED FROM PYTHON SOURCE LINES 91-102 The list of available specialized TensorDictModules is available in the :ref:`API reference `. Networks -------- TorchRL also provides regular modules that can be used without recurring to tensordict features. The two most common networks you will encounter are the :class:`~torchrl.modules.MLP` and the :class:`~torchrl.modules.ConvNet` (CNN) modules. We can substitute our policy module with one of these: .. GENERATED FROM PYTHON SOURCE LINES 102-113 .. code-block:: Python from torchrl.modules import MLP module = MLP( out_features=env.action_spec.shape[-1], num_cells=[32, 64], activation_class=torch.nn.Tanh, ) policy = Actor(module) rollout = env.rollout(max_steps=10, policy=policy) .. GENERATED FROM PYTHON SOURCE LINES 114-139 TorchRL also supports RNN-based policies. Since this is a more technical topic, it is treated in :ref:`a separate tutorial `. Probabilistic policies ---------------------- Policy-optimization algorithms like `PPO `_ require the policy to be stochastic: unlike in the examples above, the module now encodes a map from the observation space to a parameter space encoding a distribution over the possible actions. TorchRL facilitates the design of such modules by grouping under a single class the various operations such as building the distribution from the parameters, sampling from that distribution and retrieving the log-probability. Here, we'll be building an actor that relies on a regular normal distribution using three components: - An :class:`~torchrl.modules.MLP` backbone reading observations of size ``[3]`` and outputting a single tensor of size ``[2]``; - A :class:`~tensordict.nn.distributions.NormalParamExtractor` module that will split this output on two chunks, a mean and a standard deviation of size ``[1]``; - A :class:`~torchrl.modules.tensordict_module.ProbabilisticActor` that will read those parameters as ``in_keys``, create a distribution with them and populate our tensordict with samples and log-probabilities. .. GENERATED FROM PYTHON SOURCE LINES 139-159 .. code-block:: Python from tensordict.nn.distributions import NormalParamExtractor from torch.distributions import Normal from torchrl.modules import ProbabilisticActor backbone = MLP(in_features=3, out_features=2) extractor = NormalParamExtractor() module = torch.nn.Sequential(backbone, extractor) td_module = TensorDictModule(module, in_keys=["observation"], out_keys=["loc", "scale"]) policy = ProbabilisticActor( td_module, in_keys=["loc", "scale"], out_keys=["action"], distribution_class=Normal, return_log_prob=True, ) rollout = env.rollout(max_steps=10, policy=policy) print(rollout) .. rst-class:: sphx-glr-script-out .. code-block:: none TensorDict( fields={ action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), loc: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), next: TensorDict( fields={ done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False), reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False), observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False), sample_log_prob: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), scale: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False) .. GENERATED FROM PYTHON SOURCE LINES 160-172 There are a few things to note about this rollout: - Since we asked for it during the construction of the actor, the log-probability of the actions given the distribution at that time is also written. This is necessary for algorithms like PPO. - The parameters of the distribution are returned within the output tensordict too under the ``"loc"`` and ``"scale"`` entries. You can control the sampling of the action to use the expected value or other properties of the distribution instead of using random samples if your application requires it. This can be controlled via the :func:`~torchrl.envs.utils.set_exploration_type` function: .. GENERATED FROM PYTHON SOURCE LINES 172-182 .. code-block:: Python from torchrl.envs.utils import ExplorationType, set_exploration_type with set_exploration_type(ExplorationType.DETERMINISTIC): # takes the mean as action rollout = env.rollout(max_steps=10, policy=policy) with set_exploration_type(ExplorationType.RANDOM): # Samples actions according to the dist rollout = env.rollout(max_steps=10, policy=policy) .. GENERATED FROM PYTHON SOURCE LINES 183-197 Check the ``default_interaction_type`` keyword argument in the docstrings to know more. Exploration ----------- Stochastic policies like this somewhat naturally trade off exploration and exploitation, but deterministic policies won't. Fortunately, TorchRL can also palliate to this with its exploration modules. We will take the example of the :class:`~torchrl.modules.EGreedyModule` exploration module (check also :class:`~torchrl.modules.AdditiveGaussianModule` and :class:`~torchrl.modules.OrnsteinUhlenbeckProcessModule`). To see this module in action, let's revert to a deterministic policy: .. GENERATED FROM PYTHON SOURCE LINES 197-203 .. code-block:: Python from tensordict.nn import TensorDictSequential from torchrl.modules import EGreedyModule policy = Actor(MLP(3, 1, num_cells=[32, 64])) .. GENERATED FROM PYTHON SOURCE LINES 204-212 Our :math:`\epsilon`-greedy exploration module will usually be customized with a number of annealing frames and an initial value for the :math:`\epsilon` parameter. A value of :math:`\epsilon = 1` means that every action taken is random, while :math:`\epsilon=0` means that there is no exploration at all. To anneal (i.e., decrease) the exploration factor, a call to :meth:`~torchrl.modules.EGreedyModule.step` is required (see the last :ref:`tutorial ` for an example). .. GENERATED FROM PYTHON SOURCE LINES 212-216 .. code-block:: Python exploration_module = EGreedyModule( spec=env.action_spec, annealing_num_steps=1000, eps_init=0.5 ) .. GENERATED FROM PYTHON SOURCE LINES 217-221 To build our explorative policy, we only had to concatenate the deterministic policy module with the exploration module within a :class:`~tensordict.nn.TensorDictSequential` module (which is the analogous to :class:`~torch.nn.Sequential` in the tensordict realm). .. GENERATED FROM PYTHON SOURCE LINES 221-231 .. code-block:: Python exploration_policy = TensorDictSequential(policy, exploration_module) with set_exploration_type(ExplorationType.DETERMINISTIC): # Turns off exploration rollout = env.rollout(max_steps=10, policy=exploration_policy) with set_exploration_type(ExplorationType.RANDOM): # Turns on exploration rollout = env.rollout(max_steps=10, policy=exploration_policy) .. GENERATED FROM PYTHON SOURCE LINES 232-251 Because it must be able to sample random actions in the action space, the :class:`~torchrl.modules.EGreedyModule` must be equipped with the ``action_space`` from the environment to know what strategy to use to sample actions randomly. Q-Value actors -------------- In some settings, the policy isn't a standalone module but is constructed on top of another module. This is the case with **Q-Value actors**. In short, these actors require an estimate of the action value (most of the time discrete) and will greedily pick up the action with the highest value. In some settings (finite discrete action space and finite discrete state space), one can just store a 2D table of state-action pairs and pick up the action with the highest value. The innovation brought by `DQN `_ was to scale this up to continuous state spaces by utilizing a neural network to encode for the ``Q(s, a)`` value map. Let's consider another environment with a discrete action space for a clearer understanding: .. GENERATED FROM PYTHON SOURCE LINES 251-255 .. code-block:: Python env = GymEnv("CartPole-v1") print(env.action_spec) .. rst-class:: sphx-glr-script-out .. code-block:: none OneHot( shape=torch.Size([2]), space=CategoricalBox(n=2), device=cpu, dtype=torch.int64, domain=discrete) .. GENERATED FROM PYTHON SOURCE LINES 256-258 We build a value network that produces one value per action when it reads a state from the environment: .. GENERATED FROM PYTHON SOURCE LINES 258-266 .. code-block:: Python num_actions = 2 value_net = TensorDictModule( MLP(out_features=num_actions, num_cells=[32, 32]), in_keys=["observation"], out_keys=["action_value"], ) .. GENERATED FROM PYTHON SOURCE LINES 267-270 We can easily build our Q-Value actor by adding a :class:`~torchrl.modules.tensordict_module.QValueModule` after our value network: .. GENERATED FROM PYTHON SOURCE LINES 270-278 .. code-block:: Python from torchrl.modules import QValueModule policy = TensorDictSequential( value_net, # writes action values in our tensordict QValueModule(spec=env.action_spec), # Reads the "action_value" entry by default ) .. GENERATED FROM PYTHON SOURCE LINES 279-283 Let's check it out! We run the policy for a couple of steps and look at the output. We should find an ``"action_value"`` as well as a ``"chosen_action_value"`` entries in the rollout that we obtain: .. GENERATED FROM PYTHON SOURCE LINES 283-287 .. code-block:: Python rollout = env.rollout(max_steps=3, policy=policy) print(rollout) .. rst-class:: sphx-glr-script-out .. code-block:: none TensorDict( fields={ action: Tensor(shape=torch.Size([3, 2]), device=cpu, dtype=torch.int64, is_shared=False), action_value: Tensor(shape=torch.Size([3, 2]), device=cpu, dtype=torch.float32, is_shared=False), chosen_action_value: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False), done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False), next: TensorDict( fields={ done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False), observation: Tensor(shape=torch.Size([3, 4]), device=cpu, dtype=torch.float32, is_shared=False), reward: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([3]), device=None, is_shared=False), observation: Tensor(shape=torch.Size([3, 4]), device=cpu, dtype=torch.float32, is_shared=False), terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False), truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)}, batch_size=torch.Size([3]), device=None, is_shared=False) .. GENERATED FROM PYTHON SOURCE LINES 288-292 Because it relies on the ``argmax`` operator, this policy is deterministic. During data collection, we will need to explore the environment. For that, we are using the :class:`~torchrl.modules.tensordict_module.EGreedyModule` once again: .. GENERATED FROM PYTHON SOURCE LINES 292-298 .. code-block:: Python policy_explore = TensorDictSequential(policy, EGreedyModule(env.action_spec)) with set_exploration_type(ExplorationType.RANDOM): rollout_explore = env.rollout(max_steps=3, policy=policy_explore) .. GENERATED FROM PYTHON SOURCE LINES 299-315 This is it for our short tutorial on building a policy with TorchRL! There are many more things you can do with the library. A good place to start is to look at the :ref:`API reference for modules `. Next steps: - Check how to use compound distributions with :class:`~tensordict.nn.distributions.CompositeDistribution` when the action is composite (e.g., a discrete and a continuous action are required by the env); - Have a look at how you can use an RNN within the policy (a :ref:`tutorial `); - Compare this to the usage of transformers with the Decision Transformers examples (see the ``example`` directory on GitHub). .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 46.376 seconds) **Estimated memory usage:** 320 MB .. _sphx_glr_download_tutorials_getting-started-1.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: getting-started-1.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: getting-started-1.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: getting-started-1.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_