Shortcuts

Running LLMs on iOS

ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime.

Prerequisites

Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the Running LLMs with C++ guide.

Runtime API

Once linked against the executorch_llm framework, you can import the necessary components.

Importing

Objective-C:

#import <ExecuTorchLLM/ExecuTorchLLM.h>

Swift:

import ExecuTorchLLM

TextLLMRunner

The ExecuTorchTextLLMRunner class (bridged to Swift as TextLLMRunner) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution. This API is experimental and subject to change.

Initialization

Create a runner by specifying paths to your serialized model (.pte) and tokenizer data, plus an array of special tokens to use during tokenization. Initialization itself is lightweight and doesn’t load the program data immediately.

Objective-C:

NSString *modelPath     = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"];
NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"];
NSArray<NSString *> *specialTokens = @[ @"<|bos|>", @"<|eos|>" ];

ExecuTorchTextLLMRunner *runner = [[ExecuTorchTextLLMRunner alloc] initWithModelPath:modelPath
                                                                       tokenizerPath:tokenizerPath
                                                                       specialTokens:specialTokens];

Swift:

let modelPath     = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")!
let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")!
let specialTokens = ["<|bos|>", "<|eos|>"]

let runner = TextLLMRunner(
  modelPath: modelPath,
  tokenizerPath: tokenizerPath,
  specialTokens: specialTokens
)

Loading

Explicitly load the model before generation to avoid paying the load cost during your first generate call.

Objective-C:

NSError *error = nil;
BOOL success = [runner loadWithError:&error];
if (!success) {
  NSLog(@"Failed to load: %@", error);
}

Swift:

do {
  try runner.load()
} catch {
  print("Failed to load: \(error)")
}

Generating

Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced.

Objective-C:

NSError *error = nil;
BOOL success = [runner generate:@"Once upon a time"
                 sequenceLength:50
              withTokenCallback:^(NSString *token) {
                NSLog(@"Generated token: %@", token);
              }
                          error:&error];
if (!success) {
  NSLog(@"Generation failed: %@", error);
}

Swift:

do {
  try runner.generate("Once upon a time", sequenceLength: 50) { token in
    print("Generated token:", token)
  }
} catch {
  print("Generation failed:", error)
}

Stopping Generation

If you need to interrupt a long‐running generation, call:

Objective-C:

[runner stop];

Swift:

runner.stop()

Demo

Get hands-on with our LLaMA iOS Demo App to see the LLM runtime APIs in action.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources