Llama4Transform¶
- class torchtune.models.llama4.Llama4Transform(path: str, *, tile_size: int, patch_size: int, max_num_tiles: int = 4, pixel_shuffle_scaling_factor: float = 0.5, special_tokens_path: Optional[str] = None, max_seq_len: Optional[int] = None, image_mean: Optional[list[float]] = None, image_std: Optional[list[float]] = None, dtype: dtype = torch.bfloat16, prompt_template: Optional[Union[str, dict[Literal['system', 'user', 'assistant', 'ipython', 'tool'], tuple[str, str]]]] = None)[source]¶
This transform combines the transforms for the different modalities of Llama 4. It is made up of the following transforms: -
torchtune.models.llama4.Llama4Tokenizer
-torchtune.models.clip.CLIPImageTransform
This transform can be used as a drop-in replacement for tokenizers in recipes and generation but handles additional transformations from the
__call__
method.- Parameters:
path (str) – Path to pretrained tiktoken tokenizer file.
tile_size (int) – Size of the tiles to divide the image into.
patch_size (int) – Size of the patches used in the CLIP vision tranformer model. This is used to calculate the number of image embeddings per image.
max_num_tiles (int) – Only used if possible_resolutions is NOT given. Maximum number of tiles to break an image into. This will be used to generate possible_resolutions, e.g. [(224, 224), (224, 448), (448, 224)] if
max_num_tiles = 2
andtile_size = 224
. Default 4.pixel_shuffle_scaling_factor (float) – scaling factor for pixel shuffle. Default is 0.5. You must ensure this matches the pixel shuffle scaling factor used in
Llama4VisionProjectionHead
if modified from default.special_tokens_path (Optional[str]) – Path to
tokenizer.json
from Hugging Face model files that contains all registered special tokens, or a local json file structured similarly. Default is None to use the canonical Llama3 special tokens.max_seq_len (Optional[int]) – maximum sequence length for tokenizing a single list of messages, after which the input will be truncated. Default is None.
image_mean (Optional[list[float]]) – Mean values of each channel, used for normalization.
image_std (Optional[list[float]]) – Standard deviations for each channel, used for normalization.
dtype (torch.dpython:type) – Data type of transformed image. Default torch.bfloat16.
prompt_template (Optional[_TemplateType]) –
template used to format the messages based on their role. This is used to add structured text around the actual messages. The structured text is used in three scenarios:
Task-specific templates to gear models for a particular task that it will expect after training
Model-specific templates that are required whenever the model is prompted, such as the [INST] tags in Llama2 and in Mistral
Community standardized templates, such as
ChatMLTemplate
The extra text will still get tokenized as normal text, not as special tokens. Default is None.
Examples
>>> model_transform = Llama4VisionTransform("/path/to/tokenizer.model", tile_size=224, patch_size=14) >>> transformed_data = model_transform({"messages": user_message, "images": [img1, img2]}) >>> print(transformed_data["tokens"]) [1, 31587, 29644, 102, 2] >>> print(transformed_data["images"][0].shape) torch.Size([4, 3, 224, 224])
- decode(token_ids: list[int], truncate_at_eos: bool = True, skip_special_tokens: bool = True) str [source]¶
Decode a list of token ids into a string.
- Parameters:
- Returns:
The decoded string.
- Return type:
- tokenize_message(message: Message, add_start_tokens: bool = True, add_end_tokens: bool = True) list[int] [source]¶
Tokenize a message into a list of token ids.