Local LLM Inference
tauri-plugin-llm loads and runs large language model inference entirely on-device. Your Tauri app ships with (or downloads) a model, and the plugin handles tokenization, chat templates, streaming token generation, and tool calling — all in Rust, with hardware acceleration where available.
The model runs in a dedicated thread. Tokens stream back to your frontend via Tauri’s event system, exactly like a cloud API — except the inference happens locally, on the user’s hardware, and nothing ever leaves the device.
Who’s This For?
Desktop app developers building with Tauri who want to add LLM capabilities without cloud dependencies. You should be comfortable with Tauri 2.x (plugin registration, commands, the event system) and have a working knowledge of LLM concepts like chat messages, tokenization, and sampling parameters.
What This Is (and Isn’t)
This is:
- A Tauri plugin that loads and runs LLM inference locally
- A streaming API for real-time token generation
- A multi-backend system supporting Llama 3.x, Qwen3, and Gemma 3 model families
- Hardware-accelerated via Metal (macOS) and optionally CUDA (Linux/Windows)
- An extensible architecture — new model backends can be added by implementing one trait
This is not:
- A model training or fine-tuning framework
- A cloud API proxy or wrapper
- A model distribution system (you provide the model files)
- A full AI agent framework (it generates tokens — what you do with them is up to you)
- Production-tested at scale (this is early-stage, actively developed software)
Current limitations:
- Only Safetensors weight format is supported in the inference path
- Desktop only for now (mobile support is a stub)
- One active model at a time (switching models shuts down the current runtime)
- No conversation history management — your app handles chat state
Prerequisites
You need a Tauri 2.x application with Rust >= 1.77.2 (stable toolchain). If you don’t have one yet, follow the Tauri getting started guide.
Setup
-
Add the dependency:
src-tauri/Cargo.toml [dependencies]tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" } -
Register the plugin:
src-tauri/src/main.rs fn main() {tauri::Builder::default().plugin(tauri_plugin_llm::init()).run(tauri::generate_context!()).expect("error while running tauri application");} -
Download a model from Hugging Face:
Terminal window uvx hf download Qwen/Qwen3-4B-Instruct -
Configure the model (optional — models load lazily on first query):
src-tauri/tauri.conf.json {"plugins": {"llm": {"llmconfig": {"name": "Qwen/Qwen3-4B-Instruct-2507","tokenizer_file": "./models/Qwen3-4B-Instruct-2507/tokenizer.json","tokenizer_config_file": "./models/Qwen3-4B-Instruct-2507/tokenizer_config.json","model_config_file": "./models/Qwen3-4B-Instruct-2507/config.json","model_index_file": "./models/Qwen3-4B-Instruct-2507/model.safetensors.index.json","model_dir": "./models/Qwen3-4B-Instruct-2507/"}}}} -
Grant permissions:
src-tauri/capabilities/default.json {"permissions": ["llm:default"]}
Usage
import { LLMStreamListener } from "tauri-plugin-llm-api";
const listener = new LLMStreamListener();
// Set up streaming callbacksawait listener.setup({ onData: (id, data, timestamp) => { const text = new TextDecoder().decode(data); console.log(text); }, onError: (msg) => console.error("Error:", msg), onEnd: (usage) => { if (usage) { console.log(`Done. ${usage.total_tokens} tokens used.`); } },});
// Send a promptawait listener.stream({ type: "Prompt", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "What is Tauri?" }, ], tools: [], max_tokens: 200, stream: true,});
// Clean up when donelistener.teardown();Configuration
LLMRuntimeConfig Reference
| Field | Type | Required | Description |
|---|---|---|---|
name | String | Yes | Model identifier. Must follow org/model format for HF Hub models. |
tokenizer_file | PathBuf? | Yes | Path to tokenizer.json. |
tokenizer_config_file | PathBuf? | No | Path to tokenizer_config.json. Provides the chat template and EOS token. |
model_config_file | PathBuf? | No | Path to config.json (model architecture configuration). |
model_index_file | PathBuf? | Conditional | Path to model.safetensors.index.json. |
model_file | PathBuf? | Conditional | Path to a single model weight file. |
model_dir | PathBuf? | No | Directory containing sharded .safetensors weight files. |
template_file | PathBuf? | No | Custom chat template file. Ignored if tokenizer_config_file provides a chat_template. |
Constructor Functions
Rather than building configs by hand, use the provided constructors:
// From a previously downloaded HF model (recommended)let config = LLMRuntimeConfig::from_hf_local_cache( "Qwen/Qwen3-4B-Instruct-2507", None::<&str>, // Uses default HF cache directory)?;
// From a JSON file on disklet config = LLMRuntimeConfig::from_path("./my-model-config.json")?;
// From a JSON stringlet config = LLMRuntimeConfig::from_raw(r#"{"name": "Qwen/Qwen3-4B-Instruct-2507", ...}"#)?;Programmatic Configuration
Use Builder::config() when you need dynamic configuration at startup. This takes precedence over tauri.conf.json.
use tauri_plugin_llm::{Builder, LLMPluginConfig, LLMRuntimeConfig};
fn main() { let runtime_config = LLMRuntimeConfig::from_hf_local_cache( "Qwen/Qwen3-4B-Instruct-2507", None::<&str>, ).expect("model not found in HF cache");
let config = LLMPluginConfig { llmconfig: runtime_config, ..Default::default() };
tauri::Builder::default() .plugin( Builder::new() .config(config) .build() ) .run(tauri::generate_context!()) .expect("error while running tauri application");}Permissions
The llm:default permission set grants access to all four commands:
| Permission | Command | Description |
|---|---|---|
allow-stream | stream | Send prompts and receive streamed responses |
allow-switch-model | switch_model | Switch the active model at runtime |
allow-list-available-models | list_available_models | Query registered model configurations |
allow-add-configuration | add_configuration | Add new model configurations dynamically |
For apps where end users should not switch models, grant only specific permissions:
{ "permissions": [ "llm:allow-stream", "llm:allow-list-available-models" ]}Supported Models
| Model Family | Tool Calling | Status |
|---|---|---|
| Llama 3.x (e.g., Llama-3.2-3B-Instruct) | Yes | Stable |
| Qwen3 (e.g., Qwen3-4B-Instruct) | Yes | Stable |
| Gemma 3 (e.g., Gemma-3-4B-IT) | Yes | In progress |
The backend is selected automatically based on the model name field in the configuration.
Hardware Acceleration
Metal acceleration is enabled by default on macOS. No additional configuration needed.
[dependencies]tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }Enable CUDA for NVIDIA GPU acceleration:
[dependencies]tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm", features = ["cuda"] }CPU inference works everywhere but is significantly slower. Suitable for testing and development.
[dependencies]tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }Memory Requirements
| Model Size | RAM (approx.) |
|---|---|
| 3B parameters | ~6 GB |
| 4B parameters | ~8 GB |
| 7B parameters | ~14 GB |
These estimates are for full-precision BF16 Safetensors format. Quantized and dynamically quantized models are more flexible on RAM usage and will require less memory.
Commands
The plugin exposes four Tauri commands, all permission-gated through the capability system.
stream
Send a prompt to the active model and receive streamed token chunks via events.
| Parameter | Type | Description |
|---|---|---|
message | Query | The prompt query |
Returns: Result<()> — The actual response comes through events, not the return value.
switch_model
Switch the active model runtime. Shuts down the current model and activates a new one.
| Parameter | Type | Description |
|---|---|---|
id | String | The model name to activate (must match a registered config name) |
Returns: Result<()>
list_available_models
Returns the names of all registered model configurations.
Returns: Result<Vec<String>>
add_configuration
Add a new model configuration at runtime without restarting the app.
| Parameter | Type | Description |
|---|---|---|
config | String | JSON string representing an LLMRuntimeConfig |
Returns: Result<()>
TypeScript API
The frontend API is provided through the LLMStreamListener class in tauri-plugin-llm-api.
LLMStreamListener Methods
| Method | Signature | Description |
|---|---|---|
setup | (callbacks: CallBacks) => Promise<void> | Register event listeners. Must be called before stream(). |
stream | (message: Query) => Promise<void> | Send a prompt to the active model. |
switchModel | (id: string) => Promise<void> | Switch the active model. |
listAvailableModels | () => Promise<string[]> | List registered model configurations. |
addConfiguration | (config: string) => Promise<void> | Add a model configuration (JSON string). |
teardown | () => void | Remove all event listeners. |
CallBacks Interface
interface CallBacks { onData: (id: number, data: Uint8Array, timestamp?: number) => void; onError: (msg: string) => void; onEnd: (usage?: TokenUsage) => void;}Streaming Events
| Event Name | Payload | Description |
|---|---|---|
query-stream-chunk | Query::Chunk | Token data. Decode data with new TextDecoder().decode(data). |
query-stream-end | Query::End | Generation complete. Contains optional TokenUsage. |
query-stream-error | string | Error message from the runtime. |
Types
Query (Discriminated Union)
| Variant | Direction | Purpose |
|---|---|---|
Prompt | Frontend → Backend | Input query with messages, tools, and sampling parameters |
Chunk | Backend → Frontend | Streamed token data |
Response | Backend → Frontend | Response with optional error, messages, and tools |
End | Backend → Frontend | Stream completion with token usage |
Status | Backend → Frontend | Status/error message |
Exit | Internal | Shutdown signal |
Prompt Fields
| Field | Type | Description |
|---|---|---|
messages | QueryMessage[] | Chat messages with role and content |
tools | string[] | MCP-compatible tool definitions (JSON strings) |
max_tokens | number? | Maximum tokens to generate |
temperature | number? | Sampling temperature (higher = more creative) |
top_k | number? | Top-K sampling — number of candidates |
top_p | number? | Top-P nucleus sampling threshold |
think | boolean? | Enable thinking/reasoning mode |
stream | boolean? | Enable streaming output |
model | string? | Target model (for multi-model setups) |
penalty | number? | Repetition penalty |
seed | GenerationSeed? | "Random" or { Fixed: number } |
sampling_config | SamplingConfig? | Sampling strategy |
chunk_size | number? | Tokens per streamed chunk |
timestamp | number? | Optional request timestamp |
SamplingConfig
| Value | Description |
|---|---|
"ArgMax" | Deterministic — always picks the highest-probability token |
"All" | Sample from the full distribution with temperature |
"TopK" | Sample from the top-K most likely tokens |
"TopP" | Sample from tokens until cumulative probability reaches P |
"TopKThenTopP" | Apply Top-K first, then Top-P |
"GumbelSoftmax" | Gumbel-Softmax sampling |
GenerationSeed
| Value | Description |
|---|---|
"Random" | Random seed each generation (non-deterministic) |
{ Fixed: number } | Fixed seed for reproducible output |
Tool Calling
The plugin supports tool calling (function calling). Each supported model family uses a different format internally, and the plugin handles parsing automatically.
- Include tool definitions in the
toolsarray of your prompt (MCP-compatible JSON strings). - The model generates a response that may contain tool call(s).
- The plugin’s tool call parser detects and extracts tool calls from the raw output.
- Tool calls are sent as a
Query::Chunkwithkind: "toolcall".
await listener.setup({ onData: (id, data, timestamp) => { const text = new TextDecoder().decode(data);
try { const toolCalls = JSON.parse(text); if (Array.isArray(toolCalls) && toolCalls[0]?.name) { console.log("Tool calls detected:", toolCalls); // Execute the tool calls in your app } } catch { // Regular text chunk console.log(text); } }, onError: (msg) => console.error(msg), onEnd: () => console.log("Done"),});Adding a New Model Backend
To support a new model architecture, implement the ModelBackend trait:
use candle_core::Tensor;use crate::error::Error;use crate::runtime::tool_call::ToolCallParser;
pub struct MyModelBackend { // Model weights, config, KV cache...}
impl ModelBackend for MyModelBackend { fn forward(&mut self, input: &Tensor, index: usize) -> Result<Tensor, Error> { // Run the model's forward pass // Return logits for the last token }
fn clear_kv_cache(&mut self) { // Reset the KV cache for a new generation }
fn tool_call_parser(&self) -> Option<&dyn ToolCallParser> { // Return a parser if the model supports tool calling None }}Then register it in the create_backend dispatcher in src/llm/runtime/backend.rs.
Troubleshooting
Common Errors
| Error | Cause | Solution |
|---|---|---|
MissingConfig | No plugin config in tauri.conf.json or Builder | Add plugins.llm.llmconfig to your Tauri config |
MissingConfigLLM(...) | Model files not downloaded or paths incorrect | Verify model file paths. Download with uvx hf download. |
MissingActiveRuntime | No model activated before streaming | Ensure llmconfig.name is set and files exist |
MissingDevice | Could not detect compute device | Check Metal/CUDA availability. Falls back to CPU. |
TemplateError(...) | Chat template rendering failed | Verify tokenizer_config.json contains a valid chat_template |
MessageEncodingError(...) | Tokenization failed | Check that tokenizer.json is valid and matches the model |
StreamError(...) | Communication channel error | Check logs — usually indicates the worker thread panicked |
Debugging
RUST_LOG=tauri_plugin_llm=trace cargo tauri devEnd-to-End Testing
The repository includes an End2End.dockerfile for running integration tests in a container with all dependencies pre-installed.
Dependencies
| Crate | Version | Purpose |
|---|---|---|
candle-core, candle-nn, candle-transformers | git (main) | ML inference framework (Metal on macOS, CUDA optional) |
tokenizers | git (main) | Hugging Face tokenizer |
hf-hub | 0.4.3 | Hugging Face Hub model downloads and cache |
minijinja | 2.15.1 | Jinja2 chat template rendering |
tauri | 2.x | Plugin framework |
tokio | 1.x | Async runtime |
tracing | 0.1 | Structured logging |
thiserror | 2 | Error type derivation |
License
PolyForm Noncommercial License 1.0.0
Authors: Matthias Kandora, Fabian-Lars Scheidt, James Q Barclay
Repository: github.com/crabnebula-dev/tauri-plugin-llm