KLRewardTransform¶
- class torchrl.envs.transforms.KLRewardTransform(actor: ProbabilisticTensorDictModule, coef=1.0, in_keys=None, out_keys=None, requires_grad=False, log_prob_key: NestedKey = 'sample_log_prob', action_key: NestedKey | None = None, functional: bool | None = None, device: torch.device | None = None)[source]¶
- A transform to add a KL[pi_current||pi_0] correction term to the reward. - This transform is used to constrain the policy to remain close to its original configuration which limits overfitting when fine-tuning using RLHF. - Parameters:
- actor (ProbabilisticTensorDictModule) – a probabilistic actor. It must have the following features: it must have a set of input ( - in_keys) and output keys (- out_keys). It must have a- get_distmethod that outputs the distribution of the action.
- coef ( - float) – the coefficient of the KL term. Defaults to- 1.0.
- in_keys (str or list of str/tuples of str) – the input key where the reward should be fetched. Defaults to - "reward".
- out_keys (str or list of str/tuples of str) – the output key where the reward should be written. Defaults to - "reward".
- requires_grad (bool, optional) – if - True, the frozen parameters will consist of differentiable clones of the original params. Defaults to- False.
 
 - Note - If the parameters are not differentiable (default), they will not follow the module when dtype or device casting operations will be called (such as - cuda(),- to()etc.). When- requires_grad=True, casting operations will work as expected.- Examples - >>> from torchrl.envs.libs.gym import GymEnv >>> from torchrl.envs import TransformedEnv >>> from tensordict.nn import TensorDictModule as Mod, NormalParamExtractor >>> from torchrl.modules import ProbabilisticActor >>> from tensordict import TensorDict >>> from torchrl.modules.distributions import TanhNormal >>> from torch import nn >>> base_env = GymEnv("Pendulum-v1") >>> n_obs = base_env.observation_spec["observation"].shape[-1] >>> n_act = base_env.action_spec.shape[-1] >>> module = Mod( ... nn.Sequential(nn.Linear(n_obs, n_act * 2), NormalParamExtractor()), ... in_keys=["observation"], ... out_keys=["loc", "scale"], ... ) >>> actor = ProbabilisticActor( ... module, ... in_keys=["loc", "scale"], ... distribution_class=TanhNormal, ... return_log_prob=True, ... ) >>> transform = KLRewardTransform(actor, out_keys="reward_kl") >>> env = TransformedEnv(base_env, transform) >>> with torch.no_grad(): ... # modify the actor parameters ... _ = TensorDict(dict(actor.named_parameters()), []).apply_(lambda x: x.data.copy_(x.data + 1)) ... td = env.rollout(3, actor) >>> # check that rewards have been modified >>> assert (td.get(("next", "reward")) != td.get(("next", "reward_kl"))).all() - Note - Because the KL formula is not always available and the parameters of the original distribution may not have been recorded, we use a stochastic estimate of the KL divergence. - forward(tensordict: TensorDictBase) TensorDictBase[source]¶
- Reads the input tensordict, and for the selected keys, applies the transform. - By default, this method: - calls directly - _apply_transform().
- does not call - _step()or- _call().
 - This method is not called within env.step at any point. However, is is called within - sample().- Note - forwardalso works with regular keyword arguments using- dispatchto cast the args names to the keys.- Examples - >>> class TransformThatMeasuresBytes(Transform): ... '''Measures the number of bytes in the tensordict, and writes it under `"bytes"`.''' ... def __init__(self): ... super().__init__(in_keys=[], out_keys=["bytes"]) ... ... def forward(self, tensordict: TensorDictBase) -> TensorDictBase: ... bytes_in_td = tensordict.bytes() ... tensordict["bytes"] = bytes ... return tensordict >>> t = TransformThatMeasuresBytes() >>> env = env.append_transform(t) # works within envs >>> t(TensorDict(a=0)) # Works offline too. 
 - transform_output_spec(output_spec: Composite) Composite[source]¶
- Transforms the output spec such that the resulting spec matches transform mapping. - This method should generally be left untouched. Changes should be implemented using - transform_observation_spec(),- transform_reward_spec()and- transform_full_done_spec(). :param output_spec: spec before the transform :type output_spec: TensorSpec- Returns:
- expected spec after the transform