On-device speech recognition vs Whisper API: when to use which
Building a voice-first iOS app in 2026 forces an early decision: do you transcribe on-device (Apple's SFSpeechRecognizer with the requiresOnDeviceRecognition flag) or server-side (OpenAI's Whisper API, or your own self-hosted whisper.cpp)?
Most of the cost, performance, and product-positioning of your app downstream of this decision will be set by which one you pick. Here's the practical breakdown, learned from shipping blip — a voice note app for Apple Watch where this decision is the entire architecture.
TL;DR
- On-device (Apple Speech) wins on: latency, privacy, offline reliability, free tier economics
- Whisper wins on: accuracy, language support beyond English, punctuation and formatting
In a freemium app, you almost certainly want both: on-device for free tier (privacy + cost), Whisper for paid tier (accuracy + language coverage).
Apple's `SFSpeechRecognizer` in 2026
iOS 17 made on-device transcription actually viable. Before that, on-device was "supported" but quality lagged the cloud path significantly. As of iOS 17+:
- Set
requiresOnDeviceRecognition = trueto force the local path. Apple ships a per-language model on the device. - Quality: in English, near-parity with Apple's cloud transcription. Real-time. Punctuation included.
- Latency: streaming, ~100-300ms behind speech, depending on device.
- Limits: about 60 seconds per recognition request (longer requires re-arming). Watch has a smaller model than iPhone. iPhone Air (M-series) handles 1 hour without breaking sweat.
- Languages: roughly 60 supported on-device, but quality drops outside English/Spanish/Mandarin.
Code-wise, the surface area is small:
let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
let request = SFSpeechAudioBufferRecognitionRequest()
request.requiresOnDeviceRecognition = true
request.shouldReportPartialResults = true
let task = recognizer.recognitionTask(with: request) { result, error in
guard let result = result else { return }
let transcription = result.bestTranscription.formattedString
// result.isFinal == true when the recognition completes
}
The requiresOnDeviceRecognition flag is the entire privacy story: when true, audio never leaves the device, and Apple's documentation guarantees this.
Whisper API in 2026
OpenAI's Whisper has been the gold standard for transcription since whisper-1 shipped, and the whisper-large-v3 model in 2024 closed remaining gaps in non-English languages.
- Accuracy: state-of-the-art across ~100 languages. Outperforms Apple Speech particularly on accented English, code-switching (mixing languages in one sentence), and uncommon proper nouns.
- Latency: API call: 200-800ms for a short clip, 2-4 seconds for a 60-second clip. Streaming via WebSocket reduces this for live transcription, but adds complexity.
- Cost (2026): $0.006 per minute via OpenAI's API. For a freemium app capping free tier transcription, this is the single biggest cost line if you let Whisper run on every free user.
- Privacy: by default, OpenAI processes your audio. Their data policy (as of 2026) is "we don't train on API data," but the audio still touches their servers. For sensitive content, this is a deal-breaker.
- Self-hosted alternative:
whisper.cppwith themediummodel runs comfortably on a Mac mini, mid-tier on a Hetzner CPX21, and gets you ~80% of the cloud quality. We use this internally for blip's "Pro" tier in the future.
When to use which (decision matrix)
| Scenario | Recommendation |
|---|---|
| Free tier of a voice app | On-device — cost goes to zero, privacy is bulletproof, English quality is fine |
| Paid tier with multilingual users | Whisper (cloud or self-hosted) — non-English accuracy is the differentiator users pay for |
| Real-time captioning / live transcription | On-device streaming — latency wins; Whisper streaming exists but is harder to operationalize |
| Long-form transcription (interviews, podcasts) | Whisper — better punctuation, paragraph structure, speaker hint |
| Offline-first product (no network assumption) | On-device only |
| Privacy-sensitive content (legal, medical) | On-device only — or self-hosted Whisper with explicit data agreements |
| Highest absolute accuracy is the product | Whisper large-v3 |
The architecture in blip
For reference, here's the actual decision blip ships with:
- Free tier: 100% on-device via
SFSpeechRecognizerwithrequiresOnDeviceRecognition = true. No network call ever. Recordings save to local SQLite, transcripts attach inline. Audio never leaves the device. - Pro tier ($4.99/mo): same flow on-device, plus an opt-in Whisper re-transcription pass that runs server-side for the user's chosen languages. The server is our own infrastructure (not OpenAI), running
whisper.cpplarge-v3, encrypted at rest, no training. Pro users can also enable webhooks that ship the transcript to their S3 / Notion / etc.
The free tier keeps cost-per-user at near-zero (no transcription bill) and lets us promise "we never see your audio" honestly. The Pro tier is where users who need 50+ language support pay for the heavier infrastructure.
Why this matters for the product
If you're building a voice app and the first thing prospective users ask is "where does my audio go?" — and they will — you have two paths:
- Lead with on-device. The privacy story writes itself. Trade-off: English quality is fine but multilingual users will notice the gap.
- Lead with Whisper-anywhere. Better quality but you owe users a clear data story (where's the audio, who can read it, do you train on it).
There's no middle ground that also feels honest to users. Pick one as your default, the other as a paid tier.
A small architectural tip
If you go on-device-first: don't store the raw audio file longer than you have to. The audio is the most sensitive piece of data your app holds; the transcript is much less so. blip keeps audio for 7 days by default, then deletes the audio and keeps just the transcript. Users opt-in to longer retention.
If you're shipping a voice app on iOS in 2026 and want to talk through the architecture trade-offs, marcelo@tapblip.com.
→ tapblip.com for the app this post came out of building.