Shortcuts

llama4_decoder

torchtune.models.llama4.llama4_decoder(*, vocab_size: int, num_layers: int, num_heads: int, num_kv_heads: int, embed_dim: int, hidden_dim: int, max_seq_len: int, attn_dropout: float = 0.0, rope_base: int = 500000, norm_eps: float = 1e-05, num_experts: int = 16, experts_per_token: int = 1, use_shared_expert: bool = True, use_qk_norm: bool = True, moe_every_n_layers: Optional[int] = None, mlp_hidden_dim: Optional[int] = None, skip_rope_interval: Optional[int] = None, attention_chunk_size: Optional[int] = None, use_scaled_rope: bool = False, rope_scale_factor: Optional[float] = 16.0, rope_low_freq_factor: Optional[float] = 1.0, rope_high_freq_factor: Optional[float] = 1.0, old_context_len: Optional[int] = 8192) TransformerDecoder[source]

Build the decoder associated with the MOE model. This includes: - Token embeddings - num_layers number of TransformerSelfAttentionLayer blocks - RMS Norm layer applied to the output of the transformer - Final projection into token space

Parameters:
  • vocab_size (int) – number of tokens in vocabulary.

  • num_layers (int) – number of layers in the transformer decoder.

  • num_heads (int) – number of query heads. For MHA this is also the number of heads for key and value

  • num_kv_heads (int) – number of key and value heads. User should ensure num_heads % num_kv_heads == 0. For standard MHA set num_kv_heads == num_heads, for GQA num_kv_heads < num_heads, and for MQA set num_kv_heads == 1.

  • embed_dim (int) – embedding dimension for self-attention

  • hidden_dim (int) – hidden dimension for the MoeLayer

  • max_seq_len (int) – maximum sequence length the model will be run with, as used by KVCache()

  • attn_dropout (float) – dropout value passed onto scaled_dot_product_attention. Default: 0.0

  • rope_base (int) – base for the rotary positional embeddings. Default: 500_000

  • norm_eps (float) – epsilon in RMS norms. Default: 1e-5

  • num_experts (int) – Number of experts in each moe layer. Default: 16

  • experts_per_token (int) – Number of experts each token will choose in Token Choice. Default: 2

  • use_shared_expert (bool) – Whether to use a shared expert or not. Default: True

  • use_qk_norm (bool) – Whether to use qk norm in RoPE layers. Default: True

  • moe_every_n_layers (Optional[int]) – Frequency of inserting MoE layers in the decoder. If set, every nth layer will be an MoE layer. Default: MoE every layer

  • mlp_hidden_dim (Optional[int]) – Hidden dim for any MLP (i.e. non-MoE) layers. Only applicable if moe_every_n_layers is not None.

  • skip_rope_interval (Optional[int]) – Frequency of inserting local attention layers in the decoder. If set, every nth layer will use local attention. Default is to always use vanilla attention

  • attention_chunk_size (Optional[int]) – Size of chunks for local attention. Required if skip_rope_interval is set.

  • use_scaled_rope (bool) – Whether to use scaled RoPE or not. Scaled RoPE is used for Llama4 Scout model, but not Maverick model. Default: False

  • rope_scale_factor (Optional[float]) – scaling factor for RoPE. Only applicable if use_scaled_rope=True. Default: 16.0

  • rope_low_freq_factor (Optional[float]) – scaling factor for low frequency RoPE. Only applicable if use_scaled_rope=True. Default: 1.0

  • rope_high_freq_factor (Optional[float]) – scaling factor for high frequency RoPE. Only applicable if use_scaled_rope=True. Default: 1.0

  • old_context_len (Optional[int]) – old context length for scaling theta. Only applicable if use_scaled_rope=True. Default: 8192

Returns:

Instantiation of MoE model.

Return type:

TransformerDecoder

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources