MAPPOLoss#
- class torchrl.objectives.multiagent.MAPPOLoss(*args, **kwargs)[source]#
Multi-Agent PPO loss with a centralised critic (Yu et al. 2022).
MAPPO trains a decentralised actor (each agent’s policy conditions only on its local observation) together with a centralised critic (single value function that conditions on the full team state or concatenated observations). The decentralised actor lets policies run independently at execution time, while the centralised critic reduces variance during training by giving every agent the same value baseline derived from full state information.
This class is a thin specialisation of
ClipPPOLoss. The differences:The default value estimator is
MultiAgentGAE, which broadcasts team-shared rewards / done flags along the agent dimension before computing returns.normalize_advantage_exclude_dimsdefaults to(-2,)so the agent dim is excluded when standardising advantages.An optional
ValueNormcan be supplied viavalue_norm=PopArtValueNorm(shape=1)to stabilise the critic loss; the MAPPO paper reports this is load-bearing on SMAC (their Table 13).RunningValueNormis a no-decay alternative for stationary reward scales.
- Parameters:
actor_network (ProbabilisticTensorDictSequential) – per-agent policy operator. Conventionally built with
MultiAgentMLPusingcentralized=False, share_params=Truefor cooperative homogeneous teams.critic_network (TensorDictModule) – centralised value operator. Build this with
MultiAgentMLPandcentralized=True, share_params=True, or with any module that consumes a global"state"key and returns("agents", "state_value")of shape[*B, n_agents, 1].
- Keyword Arguments:
value_norm (ValueNorm, optional) – if supplied, the critic target and prediction are normalised by this running normaliser before the MSE / smooth-L1 distance. Composes correctly with
clip_value(the clip radius is applied in normalised space). Defaults toNone(no value norm).clip_epsilon (float) – PPO ratio clip. Defaults to
0.2.entropy_coeff (float) – entropy bonus weight. Defaults to
0.01(MAPPO default).critic_coef (float, optional) – critic loss weight. Defaults to
1.0.normalize_advantage (bool) – whether to standardise the advantage. Defaults to
True(MAPPO default; differs from baseClipPPOLosswhich defaults toFalse).normalize_advantage_exclude_dims (tuple of int) – dimensions to exclude from advantage standardisation. Defaults to
(-2,)(the agent dim).**kwargs – forwarded to
ClipPPOLoss.
The expected tensordict layout follows the torchrl multi-agent convention (see
VmasEnv,PettingZooEnv):("agents", "observation"):[*B, T, n_agents, obs_dim]("agents", "action"):[*B, T, n_agents, action_dim]Optional
"state"at the root for centralised criticsTeam-shared
("next", "reward"),("next", "done"),("next", "terminated")of shape[*B, T, 1](or per-agent under("next", "agents", "reward")for competitive settings).
Example
>>> import torch >>> from tensordict.nn import TensorDictModule >>> from torchrl.modules import ( ... MultiAgentMLP, PopArtValueNorm, ProbabilisticActor, ... ) >>> from torchrl.modules.distributions import NormalParamExtractor, TanhNormal >>> from torchrl.objectives.multiagent import MAPPOLoss >>> n_agents, obs_dim, action_dim = 3, 6, 2 >>> actor_net = torch.nn.Sequential( ... MultiAgentMLP( ... n_agent_inputs=obs_dim, n_agent_outputs=2 * action_dim, ... n_agents=n_agents, centralized=False, share_params=True, ... ), ... NormalParamExtractor(), ... ) >>> actor_module = TensorDictModule( ... actor_net, ... in_keys=[("agents", "observation")], ... out_keys=[("agents", "loc"), ("agents", "scale")], ... ) >>> actor = ProbabilisticActor( ... module=actor_module, ... in_keys=[("agents", "loc"), ("agents", "scale")], ... out_keys=[("agents", "action")], ... distribution_class=TanhNormal, ... ) >>> critic = TensorDictModule( ... MultiAgentMLP( ... n_agent_inputs=obs_dim, n_agent_outputs=1, ... n_agents=n_agents, centralized=True, share_params=True, ... ), ... in_keys=[("agents", "observation")], ... out_keys=[("agents", "state_value")], ... ) >>> loss = MAPPOLoss(actor, critic, value_norm=PopArtValueNorm(shape=1)) >>> loss.set_keys(value=("agents", "state_value"), action=("agents", "action"))
- forward(tensordict: TensorDictBase = None) TensorDictBase#
It is designed to read an input TensorDict and return another tensordict with loss keys named “loss*”.
Splitting the loss in its component can then be used by the trainer to log the various loss values throughout training. Other scalars present in the output tensordict will be logged too.
- Parameters:
tensordict – an input tensordict with the values required to compute the loss.
- Returns:
A new tensordict with no batch dimension containing various loss scalars which will be named “loss*”. It is essential that the losses are returned with this name as they will be read by the trainer before backpropagation.