host.infer

host.infer runs on-device speech-to-text, LLM, and text-to-speech inside the app's own iframe. It spawns a local Web Worker that uses WebGPU via transformers.js (with a WASM fallback). Unlike kv, collab, or live, it does not round-trip to the host: the compute stays in the browser, so model input and output never leave the device. The worker is spawned lazily on the first call, and streaming results arrive through callbacks.

API

interface InferNamespace {
  LLM_MODELS: InferModelPreset[]; // curated on-device LLM presets
  STT_MODELS: InferModelPreset[]; // speech-to-text presets
  TTS_MODELS: InferModelPreset[]; // text-to-speech presets (clonable entries accept a reference voice)

  hasWebGPU(): Promise<boolean>;
  load(task: 'stt' | 'llm' | 'tts', opts?: { model?: string; onProgress?: (p: InferProgress) => void }): Promise<{ model: string; device: 'webgpu' | 'wasm' }>;
  transcribe(audio: Float32Array, sampleRate: number, opts?: { model?: string; onProgress?: (p: InferProgress) => void }): Promise<string>;
  generate(messages: Array<{ role: string; content: string }>, opts?: {
    model?: string; maxNewTokens?: number; temperature?: number;
    onToken?: (delta: string) => void; onProgress?: (p: InferProgress) => void;
  }): Promise<string>;
  synthesizeStream(text: string, opts?: {
    voiceId?: string; model?: string;
    onAudio?: (chunk: InferAudioChunk) => void; onProgress?: (p: InferProgress) => void;
  }): Promise<void>;
  setVoice(audioFloat32: Float32Array, sampleRate: number): Promise<boolean>; // clone the active clonable TTS voice
  cancel(): void;   // barge-in: interrupt in-flight generate/synthesize
  dispose(): void;  // tear down the worker (e.g. on unmount)
}

interface InferModelPreset { id: string; label: string; size: string; clonable?: boolean }
interface InferProgress { stage: 'stt' | 'llm' | 'tts'; file?: string; progress?: number; status?: string }
interface InferAudioChunk {
  audio: Float32Array; sampleRate: number; index: number;
  phonemes?: string;   // IPA sequence for the chunk (Kokoro only): drives phoneme-accurate lip-sync
  durationMs?: number; // playback duration for aligning a viseme timeline
}

Detect support and preload

WebGPU delivery requires the top-level document to delegate the webgpu feature to the sandbox origin (the platform does this for first-party apps). Without WebGPU the worker falls back to WASM, which is much slower and far more memory-hungry. Check first, and preload while you show a progress UI:

await host.ready();

if (!(await host.infer.hasWebGPU())) {
  showNotice('On-device acceleration unavailable; falling back to a slower path.');
}

const { model, device } = await host.infer.load('llm', {
  model: host.infer.LLM_MODELS[0].id,
  onProgress: (p) => updateBar(p.file, p.progress),
});
console.log(`loaded ${model} on ${device}`); // device: 'webgpu' | 'wasm'

The *_MODELS arrays are curated presets you can feed straight into a model picker. Each is { id, label, size, clonable? }.

Speech to text

const text = await host.infer.transcribe(monoFloat32, 16000, {
  onProgress: (p) => updateBar(p.status, p.progress),
});

transcribe takes a mono Float32Array at any sample rate plus that rate, and resolves with the recognized text.

LLM chat (streaming)

const reply = await host.infer.generate(
  [
    { role: 'system', content: 'You are a terse assistant.' },
    { role: 'user', content: prompt },
  ],
  {
    maxNewTokens: 256,
    temperature: 0.7,
    onToken: (delta) => appendToken(delta), // streams tokens as they decode
  },
);

generate streams tokens through onToken and resolves with the final assistant text.

Text to speech (streaming)

await host.infer.synthesizeStream(text, {
  voiceId: 'af_heart',
  onAudio: (chunk) => {
    // chunk.audio is a Float32Array at chunk.sampleRate; queue it for playback.
    enqueueAudio(chunk.audio, chunk.sampleRate);
    if (chunk.phonemes) driveLipSync(chunk.phonemes, chunk.durationMs);
  },
});

onAudio fires per sentence; the promise resolves when synthesis is done. For Kokoro models each chunk carries an IPA phonemes string and a durationMs, enough for phoneme-accurate lip-sync.

Voice cloning

For a TTS preset whose clonable flag is set, clone the active voice from a short mono reference clip, then synthesize in that voice:

const ok = await host.infer.setVoice(referenceFloat32, 24000);
if (ok) await host.infer.synthesizeStream('Now in the cloned voice.', { onAudio });

Barge-in and teardown

host.infer.cancel();  // interrupt the in-flight generate/synthesize (barge-in)
host.infer.dispose(); // tear down the worker; call on unmount to free GPU memory

Notes

All compute is local to the app iframe. Model weights download from a CDN on first use (watch onProgress for file progress), then cache.

WebGPU is the fast path; WASM is the fallback. Large LLM presets can exhaust memory under WASM, so gate big models behind hasWebGPU().

Capture mic audio with the Web Audio API (the app needs the microphone permission grant; see the security model) and feed the Float32Array straight into transcribe.

host.live: for provider-hosted realtime voice over a relay

Security model: the webgpu delegation and microphone grant

On this page