Skip to content

Local LLM Inference

tauri-plugin-llm loads and runs large language model inference entirely on-device. Your Tauri app ships with (or downloads) a model, and the plugin handles tokenization, chat templates, streaming token generation, and tool calling — all in Rust, with hardware acceleration where available.

The model runs in a dedicated thread. Tokens stream back to your frontend via Tauri’s event system, exactly like a cloud API — except the inference happens locally, on the user’s hardware, and nothing ever leaves the device.

Who’s This For?

Desktop app developers building with Tauri who want to add LLM capabilities without cloud dependencies. You should be comfortable with Tauri 2.x (plugin registration, commands, the event system) and have a working knowledge of LLM concepts like chat messages, tokenization, and sampling parameters.

What This Is (and Isn’t)

This is:

  • A Tauri plugin that loads and runs LLM inference locally
  • A streaming API for real-time token generation
  • A multi-backend system supporting Llama 3.x, Qwen3, and Gemma 3 model families
  • Hardware-accelerated via Metal (macOS) and optionally CUDA (Linux/Windows)
  • An extensible architecture — new model backends can be added by implementing one trait

This is not:

  • A model training or fine-tuning framework
  • A cloud API proxy or wrapper
  • A model distribution system (you provide the model files)
  • A full AI agent framework (it generates tokens — what you do with them is up to you)
  • Production-tested at scale (this is early-stage, actively developed software)

Current limitations:

  • Only Safetensors weight format is supported in the inference path
  • Desktop only for now (mobile support is a stub)
  • One active model at a time (switching models shuts down the current runtime)
  • No conversation history management — your app handles chat state

Prerequisites

You need a Tauri 2.x application with Rust >= 1.77.2 (stable toolchain). If you don’t have one yet, follow the Tauri getting started guide.


Setup

  1. Add the dependency:

    src-tauri/Cargo.toml
    [dependencies]
    tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }
  2. Register the plugin:

    src-tauri/src/main.rs
    fn main() {
    tauri::Builder::default()
    .plugin(tauri_plugin_llm::init())
    .run(tauri::generate_context!())
    .expect("error while running tauri application");
    }
  3. Download a model from Hugging Face:

    Terminal window
    uvx hf download Qwen/Qwen3-4B-Instruct
  4. Configure the model (optional — models load lazily on first query):

    src-tauri/tauri.conf.json
    {
    "plugins": {
    "llm": {
    "llmconfig": {
    "name": "Qwen/Qwen3-4B-Instruct-2507",
    "tokenizer_file": "./models/Qwen3-4B-Instruct-2507/tokenizer.json",
    "tokenizer_config_file": "./models/Qwen3-4B-Instruct-2507/tokenizer_config.json",
    "model_config_file": "./models/Qwen3-4B-Instruct-2507/config.json",
    "model_index_file": "./models/Qwen3-4B-Instruct-2507/model.safetensors.index.json",
    "model_dir": "./models/Qwen3-4B-Instruct-2507/"
    }
    }
    }
    }
  5. Grant permissions:

    src-tauri/capabilities/default.json
    {
    "permissions": [
    "llm:default"
    ]
    }

Usage

src/App.tsx
import { LLMStreamListener } from "tauri-plugin-llm-api";
const listener = new LLMStreamListener();
// Set up streaming callbacks
await listener.setup({
onData: (id, data, timestamp) => {
const text = new TextDecoder().decode(data);
console.log(text);
},
onError: (msg) => console.error("Error:", msg),
onEnd: (usage) => {
if (usage) {
console.log(`Done. ${usage.total_tokens} tokens used.`);
}
},
});
// Send a prompt
await listener.stream({
type: "Prompt",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is Tauri?" },
],
tools: [],
max_tokens: 200,
stream: true,
});
// Clean up when done
listener.teardown();

Configuration

LLMRuntimeConfig Reference

FieldTypeRequiredDescription
nameStringYesModel identifier. Must follow org/model format for HF Hub models.
tokenizer_filePathBuf?YesPath to tokenizer.json.
tokenizer_config_filePathBuf?NoPath to tokenizer_config.json. Provides the chat template and EOS token.
model_config_filePathBuf?NoPath to config.json (model architecture configuration).
model_index_filePathBuf?ConditionalPath to model.safetensors.index.json.
model_filePathBuf?ConditionalPath to a single model weight file.
model_dirPathBuf?NoDirectory containing sharded .safetensors weight files.
template_filePathBuf?NoCustom chat template file. Ignored if tokenizer_config_file provides a chat_template.

Constructor Functions

Rather than building configs by hand, use the provided constructors:

// From a previously downloaded HF model (recommended)
let config = LLMRuntimeConfig::from_hf_local_cache(
"Qwen/Qwen3-4B-Instruct-2507",
None::<&str>, // Uses default HF cache directory
)?;
// From a JSON file on disk
let config = LLMRuntimeConfig::from_path("./my-model-config.json")?;
// From a JSON string
let config = LLMRuntimeConfig::from_raw(r#"{"name": "Qwen/Qwen3-4B-Instruct-2507", ...}"#)?;

Programmatic Configuration

Use Builder::config() when you need dynamic configuration at startup. This takes precedence over tauri.conf.json.

src-tauri/src/main.rs
use tauri_plugin_llm::{Builder, LLMPluginConfig, LLMRuntimeConfig};
fn main() {
let runtime_config = LLMRuntimeConfig::from_hf_local_cache(
"Qwen/Qwen3-4B-Instruct-2507",
None::<&str>,
).expect("model not found in HF cache");
let config = LLMPluginConfig {
llmconfig: runtime_config,
..Default::default()
};
tauri::Builder::default()
.plugin(
Builder::new()
.config(config)
.build()
)
.run(tauri::generate_context!())
.expect("error while running tauri application");
}

Permissions

The llm:default permission set grants access to all four commands:

PermissionCommandDescription
allow-streamstreamSend prompts and receive streamed responses
allow-switch-modelswitch_modelSwitch the active model at runtime
allow-list-available-modelslist_available_modelsQuery registered model configurations
allow-add-configurationadd_configurationAdd new model configurations dynamically

For apps where end users should not switch models, grant only specific permissions:

src-tauri/capabilities/default.json
{
"permissions": [
"llm:allow-stream",
"llm:allow-list-available-models"
]
}

Supported Models

Model FamilyTool CallingStatus
Llama 3.x (e.g., Llama-3.2-3B-Instruct)YesStable
Qwen3 (e.g., Qwen3-4B-Instruct)YesStable
Gemma 3 (e.g., Gemma-3-4B-IT)YesIn progress

The backend is selected automatically based on the model name field in the configuration.

Hardware Acceleration

Metal acceleration is enabled by default on macOS. No additional configuration needed.

Cargo.toml
[dependencies]
tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }

Memory Requirements

Model SizeRAM (approx.)
3B parameters~6 GB
4B parameters~8 GB
7B parameters~14 GB

These estimates are for full-precision BF16 Safetensors format. Quantized and dynamically quantized models are more flexible on RAM usage and will require less memory.


Commands

The plugin exposes four Tauri commands, all permission-gated through the capability system.

stream

Send a prompt to the active model and receive streamed token chunks via events.

ParameterTypeDescription
messageQueryThe prompt query

Returns: Result<()> — The actual response comes through events, not the return value.

switch_model

Switch the active model runtime. Shuts down the current model and activates a new one.

ParameterTypeDescription
idStringThe model name to activate (must match a registered config name)

Returns: Result<()>

list_available_models

Returns the names of all registered model configurations.

Returns: Result<Vec<String>>

add_configuration

Add a new model configuration at runtime without restarting the app.

ParameterTypeDescription
configStringJSON string representing an LLMRuntimeConfig

Returns: Result<()>


TypeScript API

The frontend API is provided through the LLMStreamListener class in tauri-plugin-llm-api.

LLMStreamListener Methods

MethodSignatureDescription
setup(callbacks: CallBacks) => Promise<void>Register event listeners. Must be called before stream().
stream(message: Query) => Promise<void>Send a prompt to the active model.
switchModel(id: string) => Promise<void>Switch the active model.
listAvailableModels() => Promise<string[]>List registered model configurations.
addConfiguration(config: string) => Promise<void>Add a model configuration (JSON string).
teardown() => voidRemove all event listeners.

CallBacks Interface

interface CallBacks {
onData: (id: number, data: Uint8Array, timestamp?: number) => void;
onError: (msg: string) => void;
onEnd: (usage?: TokenUsage) => void;
}

Streaming Events

Event NamePayloadDescription
query-stream-chunkQuery::ChunkToken data. Decode data with new TextDecoder().decode(data).
query-stream-endQuery::EndGeneration complete. Contains optional TokenUsage.
query-stream-errorstringError message from the runtime.

Types

Query (Discriminated Union)

VariantDirectionPurpose
PromptFrontend → BackendInput query with messages, tools, and sampling parameters
ChunkBackend → FrontendStreamed token data
ResponseBackend → FrontendResponse with optional error, messages, and tools
EndBackend → FrontendStream completion with token usage
StatusBackend → FrontendStatus/error message
ExitInternalShutdown signal

Prompt Fields

FieldTypeDescription
messagesQueryMessage[]Chat messages with role and content
toolsstring[]MCP-compatible tool definitions (JSON strings)
max_tokensnumber?Maximum tokens to generate
temperaturenumber?Sampling temperature (higher = more creative)
top_knumber?Top-K sampling — number of candidates
top_pnumber?Top-P nucleus sampling threshold
thinkboolean?Enable thinking/reasoning mode
streamboolean?Enable streaming output
modelstring?Target model (for multi-model setups)
penaltynumber?Repetition penalty
seedGenerationSeed?"Random" or { Fixed: number }
sampling_configSamplingConfig?Sampling strategy
chunk_sizenumber?Tokens per streamed chunk
timestampnumber?Optional request timestamp

SamplingConfig

ValueDescription
"ArgMax"Deterministic — always picks the highest-probability token
"All"Sample from the full distribution with temperature
"TopK"Sample from the top-K most likely tokens
"TopP"Sample from tokens until cumulative probability reaches P
"TopKThenTopP"Apply Top-K first, then Top-P
"GumbelSoftmax"Gumbel-Softmax sampling

GenerationSeed

ValueDescription
"Random"Random seed each generation (non-deterministic)
{ Fixed: number }Fixed seed for reproducible output

Tool Calling

The plugin supports tool calling (function calling). Each supported model family uses a different format internally, and the plugin handles parsing automatically.

  1. Include tool definitions in the tools array of your prompt (MCP-compatible JSON strings).
  2. The model generates a response that may contain tool call(s).
  3. The plugin’s tool call parser detects and extracts tool calls from the raw output.
  4. Tool calls are sent as a Query::Chunk with kind: "toolcall".
src/tools.ts
await listener.setup({
onData: (id, data, timestamp) => {
const text = new TextDecoder().decode(data);
try {
const toolCalls = JSON.parse(text);
if (Array.isArray(toolCalls) && toolCalls[0]?.name) {
console.log("Tool calls detected:", toolCalls);
// Execute the tool calls in your app
}
} catch {
// Regular text chunk
console.log(text);
}
},
onError: (msg) => console.error(msg),
onEnd: () => console.log("Done"),
});

Adding a New Model Backend

To support a new model architecture, implement the ModelBackend trait:

src/llm/runtime/backend/my_model.rs
use candle_core::Tensor;
use crate::error::Error;
use crate::runtime::tool_call::ToolCallParser;
pub struct MyModelBackend {
// Model weights, config, KV cache...
}
impl ModelBackend for MyModelBackend {
fn forward(&mut self, input: &Tensor, index: usize) -> Result<Tensor, Error> {
// Run the model's forward pass
// Return logits for the last token
}
fn clear_kv_cache(&mut self) {
// Reset the KV cache for a new generation
}
fn tool_call_parser(&self) -> Option<&dyn ToolCallParser> {
// Return a parser if the model supports tool calling
None
}
}

Then register it in the create_backend dispatcher in src/llm/runtime/backend.rs.


Troubleshooting

Common Errors

ErrorCauseSolution
MissingConfigNo plugin config in tauri.conf.json or BuilderAdd plugins.llm.llmconfig to your Tauri config
MissingConfigLLM(...)Model files not downloaded or paths incorrectVerify model file paths. Download with uvx hf download.
MissingActiveRuntimeNo model activated before streamingEnsure llmconfig.name is set and files exist
MissingDeviceCould not detect compute deviceCheck Metal/CUDA availability. Falls back to CPU.
TemplateError(...)Chat template rendering failedVerify tokenizer_config.json contains a valid chat_template
MessageEncodingError(...)Tokenization failedCheck that tokenizer.json is valid and matches the model
StreamError(...)Communication channel errorCheck logs — usually indicates the worker thread panicked

Debugging

Terminal window
RUST_LOG=tauri_plugin_llm=trace cargo tauri dev

End-to-End Testing

The repository includes an End2End.dockerfile for running integration tests in a container with all dependencies pre-installed.

Dependencies

CrateVersionPurpose
candle-core, candle-nn, candle-transformersgit (main)ML inference framework (Metal on macOS, CUDA optional)
tokenizersgit (main)Hugging Face tokenizer
hf-hub0.4.3Hugging Face Hub model downloads and cache
minijinja2.15.1Jinja2 chat template rendering
tauri2.xPlugin framework
tokio1.xAsync runtime
tracing0.1Structured logging
thiserror2Error type derivation

License

PolyForm Noncommercial License 1.0.0

Authors: Matthias Kandora, Fabian-Lars Scheidt, James Q Barclay

Repository: github.com/crabnebula-dev/tauri-plugin-llm