alpaca_dataset¶
- torchtune.datasets.alpaca_dataset(tokenizer: Tokenizer, train_on_input: bool = True, max_seq_len: int = 512) InstructDataset [source]¶
Support for the Alpaca dataset from Hugging Face Datasets. https://huggingface.co/datasets/tatsu-lab/alpaca
Data input format: https://huggingface.co/datasets/tatsu-lab/alpaca#data-instances
The input is created using the prompt template from the original alpaca codebase: https://github.com/tatsu-lab/stanford_alpaca/blob/761dc5bfbdeeffa89b8bff5d038781a4055f796a/train.py#L31
where instruction, input, and output are fields from the dataset.
Masking of the prompt during training is controlled by the train_on_input flag, which is set to True by default (ref: https://github.com/tloen/alpaca-lora/blob/main/finetune.py#L49) - If train_on_input is True, the prompt is used during training and contributes to the loss. - If train_on_input is False, the prompt is masked out (tokens replaced with -100)
- Parameters:
tokenizer (Tokenizer) – Tokenizer used to encode data. Tokenize must implement an encode and decode method.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is True.
max_seq_len (int) – Maximum number of tokens in the returned input and label token id lists. Default is 512, as set by Stanford Alpaca (https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#fine-tuning), but we recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
- Returns:
dataset configured with Alpaca source data and template
- Return type:
InstructDataset
Example
>>> alpaca_ds = alpaca_dataset(tokenizer=tokenizer) >>> for batch in Dataloader(alpaca_ds, batch_size=8): >>> print(f"Batch size: {len(batch)}") >>> Batch size: 8