VocabTailActionTokenizer#
- class torchrl.data.vla.VocabTailActionTokenizer(num_bins: int = 256, *, full_vocab_size: int | None = None, norm_low: Tensor | None = None, norm_high: Tensor | None = None, norm_mask: Tensor | None = None, gripper_binarize: bool = False, gripper_binarize_threshold: float = 0.0, gripper_invert: bool = False)[source]#
OpenVLA-style vocab-tail action tokenizer.
OpenVLA (arXiv:2406.09246) discretizes each normalized action dimension over the edges of
num_binsuniform bins spanning[-1, 1]and writes the result into the lastnum_binsids of the language-model vocabulary:full_token_id = vocab_size - digitize(action). Decoding maps a token back to the corresponding bin center (there arenum_bins - 1centers). This tokenizer reproduces that exact mapping, with two id conventions:window ids (default,
full_vocab_size=None): ids in[0, num_bins)– the offset of the token inside the vocab-tail window,window_id = num_bins - digitize(action). This is the convention of a token-head VLA policy emitting anum_bins-way categorical per action dimension (e.g.VLAWrapperBasewithvocab_size=num_bins).full ids: pass
full_vocab_size(e.g.32000for LLaMA-2) to use raw language-model token ids,full_id = full_vocab_size - digitize(action).
Optionally, dataset statistics (the
norm_statsshipped with OpenVLA checkpoints) un-normalize decoded actions to the environment’s action space – and normalize actions before encoding – via the affine q01/q99 mapa_env = 0.5 * (a + 1) * (q99 - q01) + q01applied to the dimensions selected bymask(the gripper dimension is typically excluded). Seefrom_norm_stats().- Parameters:
num_bins (int) – number of bin edges per action dimension (the OpenVLA convention; there are
num_bins - 1bin centers). Defaults to256.- Keyword Arguments:
full_vocab_size (int, optional) – if provided, tokens are raw language-model ids in
[full_vocab_size - num_bins, full_vocab_size)instead of window offsets. Defaults toNone.norm_low (torch.Tensor, optional) – per-dimension lower statistics (
q01) for un-normalization. Defaults toNone(no normalization; actions live in[-1, 1]).norm_high (torch.Tensor, optional) – per-dimension upper statistics (
q99).norm_mask (torch.Tensor, optional) – boolean mask of the dimensions to (un-)normalize; unmasked dimensions pass through. Defaults to all
Truewhen statistics are given.gripper_binarize (bool, optional) – if
True, binarize unmasked dimensions (usually gripper) to-1/+1after decoding. Defaults toFalse.gripper_binarize_threshold (float, optional) – threshold used for gripper binarization: values strictly above this threshold map to
+1, the rest to-1. Defaults to0.0.gripper_invert (bool, optional) – if
True, flip the sign of unmasked dimensions after optional binarization. Defaults toFalse.
Examples
>>> import torch >>> from torchrl.data.vla import VocabTailActionTokenizer >>> tok = VocabTailActionTokenizer(256) >>> tokens = tok.encode(torch.tensor([-1.0, 0.0, 1.0])) >>> tokens tensor([255, 128, 0]) >>> tok.decode(tokens) tensor([-0.9961, 0.0000, 0.9961]) >>> # full LM-vocabulary ids (LLaMA-2) >>> tok = VocabTailActionTokenizer(256, full_vocab_size=32000) >>> tok.encode(torch.tensor([-1.0, 0.0, 1.0])) tensor([31999, 31872, 31744]) >>> tok.vocab_size 32000
See also
UniformActionTokenizerfor the plain bin-index codec used by toy token policies.- encode(actions: Tensor) Tensor[source]#
Map continuous actions
[..., action_dim]to token ids (long).
- classmethod from_norm_stats(norm_stats: dict, unnorm_key: str, *, num_bins: int = 256, full_vocab_size: int | None = None, gripper_binarize: bool = False, gripper_binarize_threshold: float = 0.0, gripper_invert: bool = False) VocabTailActionTokenizer[source]#
Build from the
norm_statsdictionary of an OpenVLA checkpoint.- Parameters:
norm_stats (dict) – the checkpoint’s normalization statistics (
model.norm_stats), mapping dataset keys to{"action": {"q01": ..., "q99": ..., "mask": ...}}.unnorm_key (str) – the dataset key to use (e.g.
"libero_spatial_no_noops").num_bins (int, optional) – number of bin edges. Defaults to
256.full_vocab_size (int, optional) – raw language-model vocabulary size when using full token ids. Defaults to
None.gripper_binarize (bool, optional) – whether to binarize unmasked gripper dimensions after decoding. Defaults to
False.gripper_binarize_threshold (float, optional) – threshold used for gripper binarization. Defaults to
0.0.gripper_invert (bool, optional) – whether to invert unmasked gripper dimensions after optional binarization. Defaults to
False.
- property vocab_size: int#
Number of distinct token ids the tokenizer can emit per position.