Shortcuts

samsum_dataset

torchtune.datasets.samsum_dataset(tokenizer: Tokenizer, train_on_input: bool = False) InstructDataset[source]

Support for the Summarize dataset and its variants from Hugging Face Datasets. https://huggingface.co/datasets/samsum

Data input format: https://huggingface.co/datasets/samsum#data-fields

The prompt template is created from llama_recipes codebase: https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/datasets/samsum_dataset.py#L13

where dialogue and summary are fields from the dataset.

Masking of the prompt during training is controlled by the train_on_input flag, which is set to False by default - If train_on_input is True, the prompt is used during training and contributes to the loss. - If train_on_input is False, the prompt is masked out (tokens replaced with -100)

Parameters:
  • tokenizer (Tokenizer) – Tokenizer used to encode data. Tokenize must implement an encode and decode method.

  • train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.

Returns:

dataset configured with Summarization source data and template

Return type:

InstructDataset

Example

>>> samsum_ds = samsum_dataset(tokenizer=tokenizer)
>>> for batch in Dataloader(samsum_ds, batch_size=8):
>>>     print(f"Batch size: {len(batch)}")
>>> Batch size: 8

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources