Shortcuts

Llama4Tokenizer

class torchtune.models.llama4.Llama4Tokenizer(path: str, special_tokens: Optional[dict[str, int]] = None, max_seq_len: Optional[int] = None, prompt_template: Optional[PromptTemplateInterface] = None)[source]

tiktoken tokenizer configured with Llama4 Instruct’s special tokens, as described in https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/

Parameters:
  • path (str) – Path to pretrained tiktoken tokenizer file.

  • special_tokens (Optional[dict[str, int]]) – mapping containing special text tokens and their registered token IDs. If left as None, this will be set to the canonical Llama4 special tokens.

  • max_seq_len (Optional[int]) – maximum sequence length for tokenizing a single list of messages, after which the input will be truncated. Default is None.

  • prompt_template (Optional[PromptTemplateInterface]) –

    template used to format the messages based on their role. This is used to add structured text around the actual messages. The structured text is used in three scenarios:

    • Task-specific templates to gear models for a particular task that it will expect after training

    • Model-specific templates that are required whenever the model is prompted, such as the [INST] tags in Llama2 and in Mistral

    • Community standardized templates, such as ChatMLTemplate

    The extra text will still get tokenized as normal text, not as special tokens. Default is None.

Examples

>>> tokenizer = Llama4Tokenizer("/path/to/tt_model")
>>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
>>> print(tokenized_text)
[1, 31587, 29644, 102, 2]
decode(token_ids: list[int], truncate_at_eos: bool = True, skip_special_tokens: bool = True) str[source]

Decode a list of token ids into a string.

Parameters:
  • token_ids (list[int]) – The list of token ids.

  • truncate_at_eos (bool) – Whether to truncate the string at the end of sequence token. Default is True.

  • skip_special_tokens (bool) – Whether to show or skip special tokens in the decoded string. Default is True.

Returns:

The decoded string.

Return type:

str

tokenize_message(message: Message, *, add_start_tokens: bool = True, add_end_tokens: bool = True) list[int][source]

Tokenize a message into a list of token ids.

Parameters:
  • message (Message) – The message to tokenize.

  • add_start_tokens (bool) – Whether to prepend a tokenized header to the message. Default is True.

  • add_end_tokens (bool) – Whether to append eot or eom id at the end of the message. Default is True.

Returns:

The list of token ids.

Return type:

list[int]

tokenize_messages(messages: list[torchtune.data._messages.Message], *, add_end_tokens: bool = True) tuple[list[int], list[bool]][source]

Tokenize a list of messages into a list of token ids and masks.

Parameters:
  • messages (list[Message]) – The list of messages to tokenize.

  • add_end_tokens (bool) – Whether to append end tokens ids (end-of-seq, end-of-turn, end-of-message) at the end of the last assistant message. This value should be set to False for generation. Default is True.

Examples

>>> # Tokenize a list of messages with default settings
>>> messages = [
...     Message(role="user", content="Hello world!", masked=True),
...     Message(role="assistant", content="How are you?", masked=False),
... ]
>>> tokenizer = Llama3Tokenizer("/path/to/tt_model")
>>> tokenizer.tokenize_messages(messages)
([1, 31587, 29644, 102, 1, 31587, 29644, 102, 2], [True, True, True, True, True, False, False, False, True])
>>> # Tokenize a list of messages with add_end_tokens set to False
>>> tokenizer.tokenize_messages(messages, add_end_tokens=False)
([1, 31587, 29644, 102, 1, 31587, 29644], [True, True, True, True, True, False, False])
Returns:

The list of token ids and the list of masks.

Return type:

tuple[list[int], list[bool]]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources