alpaca_dataset¶

torchtune.datasets.alpaca_dataset(tokenizer: Tokenizer, train_on_input: bool = True, max_seq_len: int = 512) → InstructDataset[source]¶

Support for the Alpaca dataset from Hugging Face Datasets. https://huggingface.co/datasets/tatsu-lab/alpaca

Data input format: https://huggingface.co/datasets/tatsu-lab/alpaca#data-instances

The input is created using the prompt template from the original alpaca codebase: https://github.com/tatsu-lab/stanford_alpaca/blob/761dc5bfbdeeffa89b8bff5d038781a4055f796a/train.py#L31

where instruction, input, and output are fields from the dataset.

Masking of the prompt during training is controlled by the train_on_input flag, which is set to True by default (ref: https://github.com/tloen/alpaca-lora/blob/main/finetune.py#L49) - If train_on_input is True, the prompt is used during training and contributes to the loss. - If train_on_input is False, the prompt is masked out (tokens replaced with -100)

Parameters:

tokenizer (Tokenizer) – Tokenizer used to encode data. Tokenize must implement an encode and decode method.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is True.
max_seq_len (int) – Maximum number of tokens in the returned input and label token id lists. Default is 512, as set by Stanford Alpaca (https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#fine-tuning), but we recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.

Returns:

dataset configured with Alpaca source data and template

Return type:

InstructDataset

Example

>>> alpaca_ds = alpaca_dataset(tokenizer=tokenizer)
>>> for batch in Dataloader(alpaca_ds, batch_size=8):
>>>     print(f"Batch size: {len(batch)}")
>>> Batch size: 8

alpaca_dataset¶

Docs

Tutorials

Resources