VocabTailActionTokenizer#

class torchrl.data.vla.VocabTailActionTokenizer(num_bins: int = 256, *, full_vocab_size: int | None = None, norm_low: Tensor | None = None, norm_high: Tensor | None = None, norm_mask: Tensor | None = None, gripper_binarize: bool = False, gripper_binarize_threshold: float = 0.0, gripper_invert: bool = False)[source]#

OpenVLA-style vocab-tail action tokenizer.

OpenVLA (arXiv:2406.09246) discretizes each normalized action dimension over the edges of num_bins uniform bins spanning [-1, 1] and writes the result into the last num_bins ids of the language-model vocabulary: full_token_id = vocab_size - digitize(action). Decoding maps a token back to the corresponding bin center (there are num_bins - 1 centers). This tokenizer reproduces that exact mapping, with two id conventions:

window ids (default, full_vocab_size=None): ids in [0, num_bins) – the offset of the token inside the vocab-tail window, window_id = num_bins - digitize(action). This is the convention of a token-head VLA policy emitting a num_bins-way categorical per action dimension (e.g. VLAWrapperBase with vocab_size=num_bins).
full ids: pass full_vocab_size (e.g. 32000 for LLaMA-2) to use raw language-model token ids, full_id = full_vocab_size - digitize(action).

Optionally, dataset statistics (the norm_stats shipped with OpenVLA checkpoints) un-normalize decoded actions to the environment’s action space – and normalize actions before encoding – via the affine q01/q99 map a_env = 0.5 * (a + 1) * (q99 - q01) + q01 applied to the dimensions selected by mask (the gripper dimension is typically excluded). See from_norm_stats().

Parameters:

num_bins (int) – number of bin edges per action dimension (the OpenVLA convention; there are num_bins - 1 bin centers). Defaults to 256.

Keyword Arguments:

full_vocab_size (int, optional) – if provided, tokens are raw language-model ids in [full_vocab_size - num_bins, full_vocab_size) instead of window offsets. Defaults to None.
norm_low (torch.Tensor, optional) – per-dimension lower statistics (q01) for un-normalization. Defaults to None (no normalization; actions live in [-1, 1]).
norm_high (torch.Tensor, optional) – per-dimension upper statistics (q99).
norm_mask (torch.Tensor, optional) – boolean mask of the dimensions to (un-)normalize; unmasked dimensions pass through. Defaults to all True when statistics are given.
gripper_binarize (bool, optional) – if True, binarize unmasked dimensions (usually gripper) to -1 / +1 after decoding. Defaults to False.
gripper_binarize_threshold (float, optional) – threshold used for gripper binarization: values strictly above this threshold map to +1, the rest to -1. Defaults to 0.0.
gripper_invert (bool, optional) – if True, flip the sign of unmasked dimensions after optional binarization. Defaults to False.

Examples

>>> import torch
>>> from torchrl.data.vla import VocabTailActionTokenizer
>>> tok = VocabTailActionTokenizer(256)
>>> tokens = tok.encode(torch.tensor([-1.0, 0.0, 1.0]))
>>> tokens
tensor([255, 128,   0])
>>> tok.decode(tokens)
tensor([-0.9961,  0.0000,  0.9961])
>>> # full LM-vocabulary ids (LLaMA-2)
>>> tok = VocabTailActionTokenizer(256, full_vocab_size=32000)
>>> tok.encode(torch.tensor([-1.0, 0.0, 1.0]))
tensor([31999, 31872, 31744])
>>> tok.vocab_size
32000

VocabTailActionTokenizer#

Docs

Tutorials

Resources