llama4_decoder¶

torchtune.models.llama4.llama4_decoder(*, vocab_size: int, num_layers: int, num_heads: int, num_kv_heads: int, embed_dim: int, hidden_dim: int, max_seq_len: int, attn_dropout: float = 0.0, rope_base: int = 500000, norm_eps: float = 1e-05, num_experts: int = 16, experts_per_token: int = 1, use_shared_expert: bool = True, use_qk_norm: bool = True, moe_every_n_layers: Optional[int] = None, mlp_hidden_dim: Optional[int] = None, skip_rope_interval: Optional[int] = None, attention_chunk_size: Optional[int] = None, use_scaled_rope: bool = False, rope_scale_factor: Optional[float] = 16.0, rope_low_freq_factor: Optional[float] = 1.0, rope_high_freq_factor: Optional[float] = 1.0, old_context_len: Optional[int] = 8192) → TransformerDecoder[source]¶

Build the decoder associated with the MOE model. This includes: - Token embeddings - num_layers number of TransformerSelfAttentionLayer blocks - RMS Norm layer applied to the output of the transformer - Final projection into token space

Parameters:

vocab_size (int) – number of tokens in vocabulary.
num_layers (int) – number of layers in the transformer decoder.
num_heads (int) – number of query heads. For MHA this is also the number of heads for key and value
num_kv_heads (int) – number of key and value heads. User should ensure num_heads % num_kv_heads == 0. For standard MHA set num_kv_heads == num_heads, for GQA num_kv_heads < num_heads, and for MQA set num_kv_heads == 1.
embed_dim (int) – embedding dimension for self-attention
hidden_dim (int) – hidden dimension for the MoeLayer
max_seq_len (int) – maximum sequence length the model will be run with, as used by KVCache()
attn_dropout (float) – dropout value passed onto scaled_dot_product_attention. Default: 0.0
rope_base (int) – base for the rotary positional embeddings. Default: 500_000
norm_eps (float) – epsilon in RMS norms. Default: 1e-5
num_experts (int) – Number of experts in each moe layer. Default: 16
experts_per_token (int) – Number of experts each token will choose in Token Choice. Default: 2
use_shared_expert (bool) – Whether to use a shared expert or not. Default: True
use_qk_norm (bool) – Whether to use qk norm in RoPE layers. Default: True
moe_every_n_layers (Optional[int]) – Frequency of inserting MoE layers in the decoder. If set, every nth layer will be an MoE layer. Default: MoE every layer
mlp_hidden_dim (Optional[int]) – Hidden dim for any MLP (i.e. non-MoE) layers. Only applicable if moe_every_n_layers is not None.
skip_rope_interval (Optional[int]) – Frequency of inserting local attention layers in the decoder. If set, every nth layer will use local attention. Default is to always use vanilla attention
attention_chunk_size (Optional[int]) – Size of chunks for local attention. Required if skip_rope_interval is set.
use_scaled_rope (bool) – Whether to use scaled RoPE or not. Scaled RoPE is used for Llama4 Scout model, but not Maverick model. Default: False
rope_scale_factor (Optional[float]) – scaling factor for RoPE. Only applicable if use_scaled_rope=True. Default: 16.0
rope_low_freq_factor (Optional[float]) – scaling factor for low frequency RoPE. Only applicable if use_scaled_rope=True. Default: 1.0
rope_high_freq_factor (Optional[float]) – scaling factor for high frequency RoPE. Only applicable if use_scaled_rope=True. Default: 1.0
old_context_len (Optional[int]) – old context length for scaling theta. Only applicable if use_scaled_rope=True. Default: 8192

Returns:

Instantiation of MoE model.

Return type:

TransformerDecoder

llama4_decoder¶

Docs

Tutorials

Resources