Local LLM Inference

tauri-plugin-llm loads and runs large language model inference entirely on-device. Your Tauri app ships with (or downloads) a model, and the plugin handles tokenization, chat templates, streaming token generation, and tool calling — all in Rust, with hardware acceleration where available.

The model runs in a dedicated thread. Tokens stream back to your frontend via Tauri’s event system, exactly like a cloud API — except the inference happens locally, on the user’s hardware, and nothing ever leaves the device.

Who’s This For?

Desktop app developers building with Tauri who want to add LLM capabilities without cloud dependencies. You should be comfortable with Tauri 2.x (plugin registration, commands, the event system) and have a working knowledge of LLM concepts like chat messages, tokenization, and sampling parameters.

What This Is (and Isn’t)

This is:

A Tauri plugin that loads and runs LLM inference locally
A streaming API for real-time token generation
A multi-backend system supporting Llama 3.x, Qwen3, and Gemma 3 model families
Hardware-accelerated via Metal (macOS) and optionally CUDA (Linux/Windows)
An extensible architecture — new model backends can be added by implementing one trait

This is not:

A model training or fine-tuning framework
A cloud API proxy or wrapper
A model distribution system (you provide the model files)
A full AI agent framework (it generates tokens — what you do with them is up to you)
Production-tested at scale (this is early-stage, actively developed software)

Current limitations:

Only Safetensors weight format is supported in the inference path
Desktop only for now (mobile support is a stub)
One active model at a time (switching models shuts down the current runtime)
No conversation history management — your app handles chat state

Prerequisites

You need a Tauri 2.x application with Rust >= 1.77.2 (stable toolchain). If you don’t have one yet, follow the Tauri getting started guide.

Setup

Add the dependency:

[dependencies]
tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }

fn main() {
    tauri::Builder::default()
        .plugin(tauri_plugin_llm::init())
        .run(tauri::generate_context!())
        .expect("error while running tauri application");
}

Download a model from Hugging Face:
Terminal window
```
uvx hf download Qwen/Qwen3-4B-Instruct
```
For gated models (like Llama), authenticate first: uvx huggingface-cli login. You must also accept the model’s license on huggingface.co.

Configure the model (optional — models load lazily on first query):

{
  "plugins": {
    "llm": {
      "llmconfig": {
        "name": "Qwen/Qwen3-4B-Instruct-2507",
        "tokenizer_file": "./models/Qwen3-4B-Instruct-2507/tokenizer.json",
        "tokenizer_config_file": "./models/Qwen3-4B-Instruct-2507/tokenizer_config.json",
        "model_config_file": "./models/Qwen3-4B-Instruct-2507/config.json",
        "model_index_file": "./models/Qwen3-4B-Instruct-2507/model.safetensors.index.json",
        "model_dir": "./models/Qwen3-4B-Instruct-2507/"
      }
    }
  }
}

Grant permissions:
src-tauri/capabilities/default.json
```
{
  "permissions": [
    "llm:default"
  ]
}
```

Usage

import { LLMStreamListener } from "tauri-plugin-llm-api";

const listener = new LLMStreamListener();

// Set up streaming callbacks
await listener.setup({
  onData: (id, data, timestamp) => {
    const text = new TextDecoder().decode(data);
    console.log(text);
  },
  onError: (msg) => console.error("Error:", msg),
  onEnd: (usage) => {
    if (usage) {
      console.log(`Done. ${usage.total_tokens} tokens used.`);
    }
  },
});

// Send a prompt
await listener.stream({
  type: "Prompt",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is Tauri?" },
  ],
  tools: [],
  max_tokens: 200,
  stream: true,
});

// Clean up when done
listener.teardown();

Configuration

LLMRuntimeConfig Reference

Field	Type	Required	Description
`name`	`String`	Yes	Model identifier. Must follow `org/model` format for HF Hub models.
`tokenizer_file`	`PathBuf?`	Yes	Path to `tokenizer.json`.
`tokenizer_config_file`	`PathBuf?`	No	Path to `tokenizer_config.json`. Provides the chat template and EOS token.
`model_config_file`	`PathBuf?`	No	Path to `config.json` (model architecture configuration).
`model_index_file`	`PathBuf?`	Conditional	Path to `model.safetensors.index.json`.
`model_file`	`PathBuf?`	Conditional	Path to a single model weight file.
`model_dir`	`PathBuf?`	No	Directory containing sharded `.safetensors` weight files.
`template_file`	`PathBuf?`	No	Custom chat template file. Ignored if `tokenizer_config_file` provides a `chat_template`.

Constructor Functions

Rather than building configs by hand, use the provided constructors:

// From a previously downloaded HF model (recommended)
let config = LLMRuntimeConfig::from_hf_local_cache(
    "Qwen/Qwen3-4B-Instruct-2507",
    None::<&str>,  // Uses default HF cache directory
)?;

// From a JSON file on disk
let config = LLMRuntimeConfig::from_path("./my-model-config.json")?;

// From a JSON string
let config = LLMRuntimeConfig::from_raw(r#"{"name": "Qwen/Qwen3-4B-Instruct-2507", ...}"#)?;

Programmatic Configuration

Use Builder::config() when you need dynamic configuration at startup. This takes precedence over tauri.conf.json.

use tauri_plugin_llm::{Builder, LLMPluginConfig, LLMRuntimeConfig};

fn main() {
    let runtime_config = LLMRuntimeConfig::from_hf_local_cache(
        "Qwen/Qwen3-4B-Instruct-2507",
        None::<&str>,
    ).expect("model not found in HF cache");

    let config = LLMPluginConfig {
        llmconfig: runtime_config,
        ..Default::default()
    };

    tauri::Builder::default()
        .plugin(
            Builder::new()
                .config(config)
                .build()
        )
        .run(tauri::generate_context!())
        .expect("error while running tauri application");
}

Permissions

The llm:default permission set grants access to all four commands:

Permission	Command	Description
`allow-stream`	`stream`	Send prompts and receive streamed responses
`allow-switch-model`	`switch_model`	Switch the active model at runtime
`allow-list-available-models`	`list_available_models`	Query registered model configurations
`allow-add-configuration`	`add_configuration`	Add new model configurations dynamically

For apps where end users should not switch models, grant only specific permissions:

{
  "permissions": [
    "llm:allow-stream",
    "llm:allow-list-available-models"
  ]
}

Supported Models

Model Family	Tool Calling	Status
Llama 3.x (e.g., Llama-3.2-3B-Instruct)	Yes	Stable
Qwen3 (e.g., Qwen3-4B-Instruct)	Yes	Stable
Gemma 3 (e.g., Gemma-3-4B-IT)	Yes	In progress

The backend is selected automatically based on the model name field in the configuration.

Hardware Acceleration

Metal acceleration is enabled by default on macOS. No additional configuration needed.

[dependencies]
tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }

Enable CUDA for NVIDIA GPU acceleration:

[dependencies]
tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm", features = ["cuda"] }

CPU inference works everywhere but is significantly slower. Suitable for testing and development.

[dependencies]
tauri-plugin-llm = { git = "https://github.com/crabnebula-dev/tauri-plugin-llm" }

Memory Requirements

Model Size	RAM (approx.)
3B parameters	~6 GB
4B parameters	~8 GB
7B parameters	~14 GB

These estimates are for full-precision BF16 Safetensors format. Quantized and dynamically quantized models are more flexible on RAM usage and will require less memory.

Commands

The plugin exposes four Tauri commands, all permission-gated through the capability system.

`stream`

Send a prompt to the active model and receive streamed token chunks via events.

Parameter	Type	Description
`message`	`Query`	The prompt query

Returns: Result<()> — The actual response comes through events, not the return value.

`switch_model`

Switch the active model runtime. Shuts down the current model and activates a new one.

Parameter	Type	Description
`id`	`String`	The model name to activate (must match a registered config `name`)

Returns: Result<()>

`list_available_models`

Returns the names of all registered model configurations.

Returns: Result<Vec<String>>

`add_configuration`

Add a new model configuration at runtime without restarting the app.

Parameter	Type	Description
`config`	`String`	JSON string representing an `LLMRuntimeConfig`

Returns: Result<()>

TypeScript API

The frontend API is provided through the LLMStreamListener class in tauri-plugin-llm-api.

LLMStreamListener Methods

Method	Signature	Description
`setup`	`(callbacks: CallBacks) => Promise<void>`	Register event listeners. Must be called before `stream()`.
`stream`	`(message: Query) => Promise<void>`	Send a prompt to the active model.
`switchModel`	`(id: string) => Promise<void>`	Switch the active model.
`listAvailableModels`	`() => Promise<string[]>`	List registered model configurations.
`addConfiguration`	`(config: string) => Promise<void>`	Add a model configuration (JSON string).
`teardown`	`() => void`	Remove all event listeners.

CallBacks Interface

interface CallBacks {
  onData: (id: number, data: Uint8Array, timestamp?: number) => void;
  onError: (msg: string) => void;
  onEnd: (usage?: TokenUsage) => void;
}

Streaming Events

Event Name	Payload	Description
`query-stream-chunk`	`Query::Chunk`	Token data. Decode `data` with `new TextDecoder().decode(data)`.
`query-stream-end`	`Query::End`	Generation complete. Contains optional `TokenUsage`.
`query-stream-error`	`string`	Error message from the runtime.

Types

Query (Discriminated Union)

Variant	Direction	Purpose
`Prompt`	Frontend → Backend	Input query with messages, tools, and sampling parameters
`Chunk`	Backend → Frontend	Streamed token data
`Response`	Backend → Frontend	Response with optional error, messages, and tools
`End`	Backend → Frontend	Stream completion with token usage
`Status`	Backend → Frontend	Status/error message
`Exit`	Internal	Shutdown signal

Prompt Fields

Field	Type	Description
`messages`	`QueryMessage[]`	Chat messages with `role` and `content`
`tools`	`string[]`	MCP-compatible tool definitions (JSON strings)
`max_tokens`	`number?`	Maximum tokens to generate
`temperature`	`number?`	Sampling temperature (higher = more creative)
`top_k`	`number?`	Top-K sampling — number of candidates
`top_p`	`number?`	Top-P nucleus sampling threshold
`think`	`boolean?`	Enable thinking/reasoning mode
`stream`	`boolean?`	Enable streaming output
`model`	`string?`	Target model (for multi-model setups)
`penalty`	`number?`	Repetition penalty
`seed`	`GenerationSeed?`	`"Random"` or `{ Fixed: number }`
`sampling_config`	`SamplingConfig?`	Sampling strategy
`chunk_size`	`number?`	Tokens per streamed chunk
`timestamp`	`number?`	Optional request timestamp

SamplingConfig

Value	Description
`"ArgMax"`	Deterministic — always picks the highest-probability token
`"All"`	Sample from the full distribution with temperature
`"TopK"`	Sample from the top-K most likely tokens
`"TopP"`	Sample from tokens until cumulative probability reaches P
`"TopKThenTopP"`	Apply Top-K first, then Top-P
`"GumbelSoftmax"`	Gumbel-Softmax sampling

GenerationSeed

Value	Description
`"Random"`	Random seed each generation (non-deterministic)
`{ Fixed: number }`	Fixed seed for reproducible output

Tool Calling

The plugin supports tool calling (function calling). Each supported model family uses a different format internally, and the plugin handles parsing automatically.

Include tool definitions in the tools array of your prompt (MCP-compatible JSON strings).
The model generates a response that may contain tool call(s).
The plugin’s tool call parser detects and extracts tool calls from the raw output.
Tool calls are sent as a Query::Chunk with kind: "toolcall".

await listener.setup({
  onData: (id, data, timestamp) => {
    const text = new TextDecoder().decode(data);

    try {
      const toolCalls = JSON.parse(text);
      if (Array.isArray(toolCalls) && toolCalls[0]?.name) {
        console.log("Tool calls detected:", toolCalls);
        // Execute the tool calls in your app
      }
    } catch {
      // Regular text chunk
      console.log(text);
    }
  },
  onError: (msg) => console.error(msg),
  onEnd: () => console.log("Done"),
});

Adding a New Model Backend

To support a new model architecture, implement the ModelBackend trait:

use candle_core::Tensor;
use crate::error::Error;
use crate::runtime::tool_call::ToolCallParser;

pub struct MyModelBackend {
    // Model weights, config, KV cache...
}

impl ModelBackend for MyModelBackend {
    fn forward(&mut self, input: &Tensor, index: usize) -> Result<Tensor, Error> {
        // Run the model's forward pass
        // Return logits for the last token
    }

    fn clear_kv_cache(&mut self) {
        // Reset the KV cache for a new generation
    }

    fn tool_call_parser(&self) -> Option<&dyn ToolCallParser> {
        // Return a parser if the model supports tool calling
        None
    }
}

Then register it in the create_backend dispatcher in src/llm/runtime/backend.rs.

Troubleshooting

Common Errors

Error	Cause	Solution
`MissingConfig`	No plugin config in `tauri.conf.json` or `Builder`	Add `plugins.llm.llmconfig` to your Tauri config
`MissingConfigLLM(...)`	Model files not downloaded or paths incorrect	Verify model file paths. Download with `uvx hf download`.
`MissingActiveRuntime`	No model activated before streaming	Ensure `llmconfig.name` is set and files exist
`MissingDevice`	Could not detect compute device	Check Metal/CUDA availability. Falls back to CPU.
`TemplateError(...)`	Chat template rendering failed	Verify `tokenizer_config.json` contains a valid `chat_template`
`MessageEncodingError(...)`	Tokenization failed	Check that `tokenizer.json` is valid and matches the model
`StreamError(...)`	Communication channel error	Check logs — usually indicates the worker thread panicked

Debugging

RUST_LOG=tauri_plugin_llm=trace cargo tauri dev

End-to-End Testing

The repository includes an End2End.dockerfile for running integration tests in a container with all dependencies pre-installed.

Dependencies

Crate	Version	Purpose
`candle-core`, `candle-nn`, `candle-transformers`	git (main)	ML inference framework (Metal on macOS, CUDA optional)
`tokenizers`	git (main)	Hugging Face tokenizer
`hf-hub`	0.4.3	Hugging Face Hub model downloads and cache
`minijinja`	2.15.1	Jinja2 chat template rendering
`tauri`	2.x	Plugin framework
`tokio`	1.x	Async runtime
`tracing`	0.1	Structured logging
`thiserror`	2	Error type derivation

License

PolyForm Noncommercial License 1.0.0

Authors: Matthias Kandora, Fabian-Lars Scheidt, James Q Barclay

Repository: github.com/crabnebula-dev/tauri-plugin-llm