Running LLMs on iOS¶
ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime.
Prerequisites¶
Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the Running LLMs with C++ guide.
Runtime API¶
Once linked against the executorch_llm
framework, you can import the necessary components.
TextLLMRunner¶
The ExecuTorchTextLLMRunner
class (bridged to Swift as TextLLMRunner
) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution.
This API is experimental and subject to change.
Initialization¶
Create a runner by specifying paths to your serialized model (.pte
) and tokenizer data, plus an array of special tokens to use during tokenization.
Initialization itself is lightweight and doesn’t load the program data immediately.
Objective-C:
NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"];
NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"];
NSArray<NSString *> *specialTokens = @[ @"<|bos|>", @"<|eos|>" ];
ExecuTorchTextLLMRunner *runner = [[ExecuTorchTextLLMRunner alloc] initWithModelPath:modelPath
tokenizerPath:tokenizerPath
specialTokens:specialTokens];
Swift:
let modelPath = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")!
let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")!
let specialTokens = ["<|bos|>", "<|eos|>"]
let runner = TextLLMRunner(
modelPath: modelPath,
tokenizerPath: tokenizerPath,
specialTokens: specialTokens
)
Loading¶
Explicitly load the model before generation to avoid paying the load cost during your first generate
call.
Objective-C:
NSError *error = nil;
BOOL success = [runner loadWithError:&error];
if (!success) {
NSLog(@"Failed to load: %@", error);
}
Swift:
do {
try runner.load()
} catch {
print("Failed to load: \(error)")
}
Generating¶
Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced.
Objective-C:
NSError *error = nil;
BOOL success = [runner generate:@"Once upon a time"
sequenceLength:50
withTokenCallback:^(NSString *token) {
NSLog(@"Generated token: %@", token);
}
error:&error];
if (!success) {
NSLog(@"Generation failed: %@", error);
}
Swift:
do {
try runner.generate("Once upon a time", sequenceLength: 50) { token in
print("Generated token:", token)
}
} catch {
print("Generation failed:", error)
}
Stopping Generation¶
If you need to interrupt a long‐running generation, call:
Objective-C:
[runner stop];
Swift:
runner.stop()
Demo¶
Get hands-on with our LLaMA iOS Demo App to see the LLM runtime APIs in action.