Shortcuts

Llama4VisionEncoder

class torchtune.models.llama4.Llama4VisionEncoder(clip: Module, projection_head: Module)[source]

Vision encoder model for Llama 4. This combines a pretrained vision encoder with a learnable projection head. The projection head is converted to a fusion module and supports fusion utils.

Parameters:
forward(images: Tensor) Tensor[source]
Parameters:

images (torch.Tensor) – Image tensor with shape [b x c x w x h]

Returns:

output tensor of a sequence of embeddings [b x s x d]

where sequence length (s) is (num_imgs*num_tiles)+num_embeds

Return type:

Tensor

Notation used for tensor shapes:
  • b: batch size, equal to flatten(batch x images x tiles)

  • c: number of image channels (e.g. rgb = 3)

  • w: image width

  • h: image height

  • s: sequence length computed by i*t*clip_embeds_per_tile

  • d: embed dim

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources