2026-05-09 · on device speech recognition vs whisper

On-device speech recognition vs Whisper API: when to use which

Building a voice-first iOS app in 2026 forces an early decision: do you transcribe on-device (Apple's SFSpeechRecognizer with the requiresOnDeviceRecognition flag) or server-side (OpenAI's Whisper API, or your own self-hosted whisper.cpp)?

Most of the cost, performance, and product-positioning of your app downstream of this decision will be set by which one you pick. Here's the practical breakdown, learned from shipping blip — a voice note app for Apple Watch where this decision is the entire architecture.

TL;DR

In a freemium app, you almost certainly want both: on-device for free tier (privacy + cost), Whisper for paid tier (accuracy + language coverage).

Apple's `SFSpeechRecognizer` in 2026

iOS 17 made on-device transcription actually viable. Before that, on-device was "supported" but quality lagged the cloud path significantly. As of iOS 17+:

Code-wise, the surface area is small:

let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
let request = SFSpeechAudioBufferRecognitionRequest()
request.requiresOnDeviceRecognition = true
request.shouldReportPartialResults = true

let task = recognizer.recognitionTask(with: request) { result, error in
guard let result = result else { return }
let transcription = result.bestTranscription.formattedString
// result.isFinal == true when the recognition completes
}

The requiresOnDeviceRecognition flag is the entire privacy story: when true, audio never leaves the device, and Apple's documentation guarantees this.

Whisper API in 2026

OpenAI's Whisper has been the gold standard for transcription since whisper-1 shipped, and the whisper-large-v3 model in 2024 closed remaining gaps in non-English languages.

When to use which (decision matrix)

ScenarioRecommendation
Free tier of a voice appOn-device — cost goes to zero, privacy is bulletproof, English quality is fine
Paid tier with multilingual usersWhisper (cloud or self-hosted) — non-English accuracy is the differentiator users pay for
Real-time captioning / live transcriptionOn-device streaming — latency wins; Whisper streaming exists but is harder to operationalize
Long-form transcription (interviews, podcasts)Whisper — better punctuation, paragraph structure, speaker hint
Offline-first product (no network assumption)On-device only
Privacy-sensitive content (legal, medical)On-device only — or self-hosted Whisper with explicit data agreements
Highest absolute accuracy is the productWhisper large-v3

The architecture in blip

For reference, here's the actual decision blip ships with:

The free tier keeps cost-per-user at near-zero (no transcription bill) and lets us promise "we never see your audio" honestly. The Pro tier is where users who need 50+ language support pay for the heavier infrastructure.

Why this matters for the product

If you're building a voice app and the first thing prospective users ask is "where does my audio go?" — and they will — you have two paths:

  1. Lead with on-device. The privacy story writes itself. Trade-off: English quality is fine but multilingual users will notice the gap.
  2. Lead with Whisper-anywhere. Better quality but you owe users a clear data story (where's the audio, who can read it, do you train on it).

There's no middle ground that also feels honest to users. Pick one as your default, the other as a paid tier.

A small architectural tip

If you go on-device-first: don't store the raw audio file longer than you have to. The audio is the most sensitive piece of data your app holds; the transcript is much less so. blip keeps audio for 7 days by default, then deletes the audio and keeps just the transcript. Users opt-in to longer retention.


If you're shipping a voice app on iOS in 2026 and want to talk through the architecture trade-offs, marcelo@tapblip.com.

tapblip.com for the app this post came out of building.

Try blip free.

Voice notes for Apple Watch. Tap your wrist before the thought's gone.

Get blip →