~/blog/engineering/building-taptalk-local-speech-to-text-macos
// dispatch โ€” Engineering
ENGINEERING05/24/2615 MIN

TapTalk: Building a Free, Local, Open-Source Alternative to SuperWhisper and Whisper Flow

How I built TapTalk - a fully local, privacy-first speech-to-text app for macOS in Rust and SwiftUI. On-device Whisper, Core ML on the Neural Engine, chip-tuned inference, and a global push-to-talk hotkey. No cloud, no subscription.

โš  TL;DR

I built TapTalk - a free, open-source dictation app for macOS that runs Whisper entirely on your Mac. Hold a hotkey, speak, and the text gets pasted into whatever app you're in. No audio leaves the device, no subscription, no account. It's a native SwiftUI + Rust app that runs the Whisper encoder on Apple's Neural Engine via Core ML, detects your exact chip (M1 through M5), and tunes inference parameters to match. This is how I planned it and how I built it - the architecture, the audio pipeline, the chip-tuning, and the parts that were genuinely hard.

The frustration that started this

There's a specific kind of friction I'd made my peace with: I dictate a lot. Notes, commit messages, half-formed thoughts I want to get out of my head before they evaporate. And every good tool for it wanted something from me. SuperWhisper wants a subscription. Whisper Flow is slick but it's a paid product with its own pricing. And a lot of the "free" options route your audio to a server somewhere - which, when the thing you're dictating is a password reset email or an unreleased product idea, is exactly the wrong default.

So the question I kept coming back to was: why is this not just a local app? Whisper runs fine on a laptop. Apple ships a Neural Engine in every Mac I own. The model weights are open. There is no technical reason my voice needs to touch the internet to become text.

TapTalk is my answer. It's a fully local, privacy-first dictation app for macOS. You hold a key, you talk, you release, and the transcription lands in the active text field. Every byte of audio stays on your machine. There's no account, no API key, no recurring cost. It's MIT-licensed and the whole thing is on GitHub.

This post is the build log: how I planned it, the architecture I landed on, and the engineering that actually matters - the audio capture tricks, the voice-activity trimming, and the chip-tuned Core ML inference that makes it fast enough to feel instant. If you want to skip ahead and just use it, there's a download link at the bottom.


The constraints I set before writing a line of code

I've learned the hard way that the interesting decisions in a project are the ones you make at the start, when you're deciding what the thing won't be. Before I wrote any code, I wrote down the rules:

  • Nothing leaves the device. No cloud transcription as the default, no telemetry, no network calls in the hot path. The only time TapTalk touches the network is to download a model the first time you ask for one.
  • No web tech. No Electron, no WebView, no bundled Chromium. A dictation app should not cost 300 MB of RAM at idle. This had to be native - SwiftUI and AppKit, nothing else on the UI side.
  • No bundled models. Whisper Large v3 is 3 GB. Shipping that inside the app bundle is absurd. Models download at runtime to Application Support, and you only pull the tiers you actually want.
  • Apple Silicon only, for v1. I wasn't going to pretend to support Intel Macs I can't optimize for. The whole performance story depends on the Neural Engine and Metal. Scoping to M-series let me go deep instead of wide.
  • The core should be portable. The compute - audio, voice detection, inference - should be a standalone library that doesn't know SwiftUI exists. If I ever want a CLI or a Linux build, the core shouldn't have to change.

That last rule is the one that shaped everything. It's why this isn't a pure-Swift app.


How I planned the architecture

The natural instinct on macOS is to build the whole thing in Swift. But Swift is not where I want to live for real-time audio resampling, ring buffers, and FFI into a C++ inference engine. That's Rust's home turf - predictable memory, zero-cost abstractions, and a mature audio ecosystem.

So I split the app cleanly in two: a Rust core that does all the heavy lifting, and a SwiftUI app that owns the experience. Between them sits UniFFI, Mozilla's tool for generating a typed Swift bridge from annotated Rust. I write the Rust API, run a code generator, and get Swift bindings that feel hand-written. No manual C headers, no @_silgen_name incantations.

The dependency direction is strict and one-way:

โ—† DIAGRAMMERMAID

Views depend on services. Services depend on the core through the bridge. The core depends on nothing app-specific - it doesn't import a single Apple UI framework. You could lift core/ out and link it into a command-line tool tomorrow.

Each side plays to its strengths. Rust handles the things that need to be correct and fast under the hood: pulling audio off CoreAudio, resampling it, trimming silence, and driving whisper.cpp. Swift handles the things that need to feel like macOS: a global hotkey, accessibility, pasting into other apps, and the settings UI.

The build glues it together with one command. make run compiles the Rust static library, runs uniffi-bindgen to regenerate the Swift bindings, uses xcodegen to produce the Xcode project from a YAML spec, and then xcodebuild to compile and launch. No checked-in .xcodeproj, no merge conflicts in project files.

cargo build --release   โ†’  libtap_talk_core.a
uniffi-bindgen          โ†’  TapTalkCore.swift (generated bindings)
xcodegen                โ†’  TapTalk.xcodeproj (from project.yml)
xcodebuild              โ†’  TapTalk.app

The pipeline, end to end

Here's what actually happens between you pressing a key and the text appearing. It's worth seeing the whole flow before I dig into each piece, because a few of the design decisions only make sense in context.

โ—† DIAGRAMMERMAID

Note what's not there: no network call, no upload, no waiting on a server. The whole loop is local. The only round-trip is from your microphone to your Neural Engine and back to your cursor.

Let me walk through the interesting parts.


The global hotkey, and why it fights you

The first thing TapTalk needs is to know when you're holding the key - from any app, even when TapTalk isn't focused. macOS does not make this easy, and for good reason: an app that can see every keystroke system-wide is exactly what a keylogger is.

The mechanism is a CGEventTap installed at the session level. It requires the user to grant Accessibility permission, and then macOS calls a C callback for the events you ask for. TapTalk only listens for flagsChanged - modifier key transitions - because the default hotkey is a modifier (Right Command to dictate, Option for smart mode). The tap is listenOnly, so it observes events without consuming them; your keystrokes still reach whatever app you're using.

โ— โ— โ—SWIFT
let mask: CGEventMask = 1 << CGEventType.flagsChanged.rawValue

guard let tap = CGEvent.tapCreate(
    tap: .cgSessionEventTap,
    place: .headInsertEventTap,
    options: .listenOnly,
    eventsOfInterest: mask,
    callback: hotkeyCallback,
    userInfo: Unmanaged.passUnretained(self).toOpaque()
) else {
    return false
}

That's the easy part. The hard part is that an event tap is not a fire-and-forget thing - it can silently die. macOS will disable a tap if the callback takes too long (tapDisabledByTimeout), and it can disable one when system policy changes. There's also a nastier failure mode where the tap goes inert without ever firing the disable event, usually after the system has been throttling your process.

I handle the explicit disable events inline - when the callback receives tapDisabledByTimeout or tapDisabledByUserInput, it just re-enables the tap and resets the held-key state:

โ— โ— โ—SWIFT
if type == .tapDisabledByTimeout || type == .tapDisabledByUserInput {
    if let tap = service.eventTap {
        service.isHeld = false
        service.smartIsHeld = false
        CGEvent.tapEnable(tap: tap, enable: true)
    }
    return Unmanaged.passUnretained(event)
}

For the silent-death case, there's a health check: a 5-second timer that asks CGEvent.tapIsEnabled() and, if the tap has gone inert, re-enables it - or rebuilds it from scratch if re-enabling doesn't take. It's defensive, but the alternative is a dictation app that just stops responding to its hotkey until you relaunch it. That's the kind of bug that makes people uninstall.

There's one more subtlety. macOS App Nap throttles apps that aren't focused, and a throttled app's event tap can stall. Since TapTalk lives in the menu bar and is almost never the focused app, this would have been fatal. The fix is to tell the OS the app is doing user-initiated work via ProcessInfo.beginActivity, which keeps it out of the nap state.


Capturing audio without re-triggering the mic prompt

Once the key goes down, the Rust core starts recording. This is where I hit a problem that doesn't show up in any tutorial.

The obvious design is: on key-down, build a CoreAudio input stream; on key-up, tear it down. Clean, symmetric. But the first time a stream touches the microphone, macOS may need to validate microphone permission through its privacy system (TCC). If that validation happens inside the hotkey callback, it can block the main thread - and you lose the key-up event entirely. The user holds the key, talks, releases, and nothing happens.

The fix is a persistent stream. TapTalk builds the CoreAudio stream once and keeps it alive across recordings. Recording is just a flag the audio callback checks; starting and stopping is play() and pause(), not create and destroy. The AudioUnit stays warm, so there's no permission round-trip in the hot path.

โ— โ— โ—RUST
// Creates the CoreAudio stream once; subsequent calls are a no-op.
// Keeping the stream alive across recordings prevents macOS TCC
// from re-validating mic permission on each start().
fn ensure_stream(&self) -> Result<u32, String> {
    let mut guard = self.stream_state.lock()
        .map_err(|e| format!("lock: {e}"))?;
    if let Some(ref s) = *guard {
        return Ok(s.source_sample_rate);
    }
    // ... build the stream once, store it ...
}

I take it a step further and pre-warm the stream at launch, in the background, so the very first dictation is as fast as the hundredth:

โ— โ— โ—RUST
/// Pre-creates the CoreAudio stream so TCC validation happens early,
/// not inside the hotkey callback where it would block the main thread.
pub fn warm_up(&self) -> Result<(), String> {
    self.ensure_stream().map(|_| ())
}

The audio callback itself is deliberately dumb - it runs on the real-time audio thread, so it does the absolute minimum: check the recording flag, downmix to mono if needed, and append to a shared buffer. No allocation games, no locks it might contend on (it uses try_lock and drops the frame rather than block the audio thread).

Resampling to what Whisper wants

Whisper is hard-wired to 16 kHz mono. Your microphone almost certainly isn't - most Macs capture at 48 kHz. So on stop, if the device rate isn't already 16 kHz, the samples get resampled with rubato using a high-quality sinc interpolator:

โ— โ— โ—RUST
let params = SincInterpolationParameters {
    sinc_len: 256,
    f_cutoff: 0.95,
    interpolation: SincInterpolationType::Linear,
    oversampling_factor: 256,
    window: WindowFunction::BlackmanHarris2,
};

Resampling quality matters more than you'd think for transcription accuracy - a cheap linear resample introduces aliasing artifacts that a model trained on clean 16 kHz audio reads as noise. A 256-tap windowed sinc is overkill for a phone call and exactly right for not degrading recognition.


Trimming silence with Silero VAD

People don't start talking the instant they press a key, and they don't stop the instant they release. There's dead air on both ends. Feeding that silence to Whisper is wasted compute and an invitation for the model to hallucinate words into the quiet.

So before transcription, the audio passes through a voice activity detector - Silero VAD, via the voice_activity_detector crate. It chops the audio into 512-sample chunks (32 ms windows at 16 kHz), scores each chunk for the probability of speech, and finds the first and last chunk above a 0.5 confidence threshold.

โ— โ— โ—RUST
const SAMPLE_RATE: i64 = 16_000;
const CHUNK_SIZE: usize = 512;
const SPEECH_THRESHOLD: f32 = 0.5;
const PAD_CHUNKS: usize = 4;

The one detail I care about here is the padding. I don't trim exactly to the speech boundaries - I keep 4 chunks (about 128 ms) of audio on either side. Cut too tight and you clip the leading consonant of the first word or the trailing sound of the last one. That little bit of breathing room is the difference between "transcribe this" and "ranscribe thi". If the detector finds no speech at all, TapTalk returns nothing rather than handing Whisper pure silence to invent from.


The part I'm most proud of: chip-tuned, Core ML inference

This is where TapTalk goes from "works" to "feels instant," and it's the part that took the most iteration. The headline of the latest release is a single commit: Core ML encoder download + chip-tuned whisper params. Here's what's behind it.

Two engines in one model

Whisper is an encoder-decoder model. The encoder turns your audio into a dense representation - it's the expensive half, especially for the larger models. The decoder turns that representation into text token by token - comparatively cheap.

The inference backend is whisper.cpp (through whisper-rs), and the key feature I lean on is that whisper.cpp can run the encoder as a Core ML model on Apple's Neural Engine - a dedicated accelerator that is separate from both the CPU and the GPU. The decoder stays on the CPU (GGML). Offloading the heavy encoder to the Neural Engine frees the CPU and is dramatically faster than running it on CPU cores. The in-app banner that nudges you to enable it isn't exaggerating when it says transcription can run several times faster - the encoder is the bottleneck, and the Neural Engine eats it for breakfast.

That's why a model in TapTalk is actually two downloads: a .bin GGML file (the full model, required) and a .mlmodelc Core ML bundle (the accelerated encoder, optional). Which brings up a design decision I'm happy with.

Core ML is an optimization, not a dependency

The Core ML encoder is a performance feature, not a correctness feature. The GGML model alone transcribes perfectly well - just slower. So the download is split into two phases, and a failure in the second phase doesn't sink the install:

โ— โ— โ—RUST
pub fn download(&self, tier: u8, progress_cb: &dyn Fn(DownloadProgress)) {
    let ggml_result = self.download_ggml(tier, t.ggml_filename, progress_cb);

    if ggml_result.is_ok() {
        // Core ML encoder is a perf optimization, not a correctness requirement.
        // Failures here must not abort the install - `.bin` alone still transcribes.
        if let Err(e) = self.download_coreml_inner(tier, t.coreml_filename, progress_cb) {
            eprintln!("tt-coreml: download failed for tier {tier}: {e}");
        }
    }
    // ...
}

There's also a download_coreml_only path so you can retroactively add Neural Engine acceleration to a model you already installed - the UI surfaces it as an "optimize" prompt rather than forcing a full re-download.

Detecting the exact chip

Here's the insight that makes the tuning work: not all Apple Silicon behaves the same. An M1 is not an M2 is not an M5. They have different numbers of performance cores and meaningfully different GPU behavior. A one-size-fits-all parameter set leaves performance on the table on the big chips and risks instability on the small ones.

So TapTalk asks the OS exactly what it's running on, via sysctl:

โ— โ— โ—RUST
fn detect_uncached() -> ChipInfo {
    let brand = sysctl_string("machdep.cpu.brand_string").unwrap_or_default();
    let family = parse_family(&brand);             // "Apple M3 Pro" -> ChipFamily::M3
    let performance_cores = sysctl_u32("hw.perflevel0.physicalcpu").unwrap_or(4);
    ChipInfo { family, performance_cores }
}

machdep.cpu.brand_string gives the marketing name ("Apple M3 Pro"), which I parse down to a family. hw.perflevel0.physicalcpu gives the count of performance cores specifically - not the efficiency cores, which you don't want Whisper scheduled onto. The result is cached in a OnceLock, so the sysctl syscalls happen exactly once for the lifetime of the process.

The sysctl calls themselves are the one place I reach for unsafe, since they're raw libc. They're scoped tightly, with a // SAFETY: comment on each, and wrapped so the rest of the codebase never sees a raw pointer.

Tuning the knobs

With the chip identified, three parameters get tuned.

Flash attention - but only where it's safe. Flash attention is a faster GPU attention implementation that gives roughly a 20% encoder speedup. The catch: the M1 GPU is flaky with flash attention on certain attention shapes - it can produce garbage or crash. M2 and later are stable. So I gate it on the chip family:

โ— โ— โ—RUST
let chip = platform::detect();
let mut params = WhisperContextParameters::default();
// M1 GPU is known to be flaky with flash-attention on some attention shapes.
// Enable only on M2+ where it's stable and yields ~20% encoder speedup.
if matches!(chip.family, ChipFamily::M2 | ChipFamily::M3 | ChipFamily::M4 | ChipFamily::M5) {
    params.flash_attn(true);
}

Thread count matched to real cores. whisper.cpp defaults to a fixed thread count. But spawning more threads than you have performance cores just adds scheduling overhead, and spawning fewer leaves the chip idle. So the thread count is set to the actual performance-core count the OS reported - 4 on a base M1, more on the Pro and Max parts:

โ— โ— โ—RUST
params.set_n_threads(self.chip.performance_cores.max(2) as i32);

Dynamic audio context - the one I like most. This is a small formula with an outsized effect. Whisper's encoder cost scales with how many audio-context tokens it processes, and it works out to roughly 50 tokens per second of audio. The default budget is 1500 tokens - enough for a full 30-second window. But dictation is short. If you said one sentence, paying for 30 seconds of encoder work is pure waste.

So I size the audio context to the actual length of what you said:

โ— โ— โ—RUST
// Encoder cost scales with audio_ctx tokens. ~50 tokens per second of audio.
// Default 1500 over-processes short utterances; cap trims it without affecting long clips.
let secs = (samples.len() as f32 / 16_000.0).ceil() as i32;
let audio_ctx = ((secs * 50) + 64).clamp(256, 1500);
params.set_audio_ctx(audio_ctx);

A two-second "yes, ship it" gets a tiny context and finishes almost instantly. A long monologue gets the full budget. The clamp floor of 256 keeps very short clips from being under-processed, and the ceiling of 1500 caps it at Whisper's window so long utterances aren't truncated. The whole thing is one line of arithmetic, and it's one of the biggest perceived-latency wins in the app.

Here's the decision flow the engine runs through every time it loads a model and transcribes:

โ—† DIAGRAMMERMAID

A note on honesty: I haven't published a rigorous benchmark suite, so I'm not going to throw a precise milliseconds table at you. What I can say is that these changes - Neural Engine encoding plus matched threads plus right-sized audio context - are the difference between dictation that feels like a network request and dictation that feels like the text was already there.


Model tiers: download what you need

There are four tiers, and they trade speed for accuracy. They live in a single table in the Rust core, and you pull only the ones you want:

TierModelGGML sizeBest for
1Tiny75 MBFast notes, low-stakes text
2Small466 MBA good everyday balance
3Large v3 Turbo1.6 GBRecommended - near-Large accuracy, much faster
4Large v33.0 GBMaximum accuracy

Models download from the public whisper.cpp repository on Hugging Face into ~/Library/Application Support/talk.tap.app/models/. Nothing is bundled in the app - a fresh TapTalk install is small, and your disk only fills up with the tiers you actually chose. The download streams in chunks with progress reported back to the UI, writes to a .partial file, and atomically renames on completion so a half-finished download can never masquerade as a real model.


Smart Mode: transcription that knows where you are

Plain dictation gives you exactly what you said, filler words and all. That's perfect for some apps and wrong for others. A Slack message wants to be casual. A terminal wants a single shell command, not a sentence about a command. An email wants to be cleaned up and professional.

So TapTalk has a second hotkey - Smart Mode (Option by default) - that runs the raw transcript through a local LLM before pasting. And the clever bit is that the rewrite is context-aware: it looks at which app is frontmost and adjusts the instruction accordingly.

โ— โ— โ—SWIFT
return "\(base) The user is currently in \(app). Based on what this application is typically used for, decide the most appropriate output format: if it is a code editor, output code for coding instructions or clean prose for comments/docs; if it is a terminal, output a shell command on one line; if it is a messaging app, write a casual concise message; if it is an email client, write a professional email body; if it is a note-taking app, structure as clean notes with bullets where appropriate. ..."

The LLM backend is pluggable. You can point it at any OpenAI-compatible endpoint - a local Ollama or LM Studio server, or a hosted API if you choose to. Or you can let TapTalk run a small model entirely on-device: it manages a local llama-server subprocess running Qwen2.5-1.5B, so even the rewriting stays local if you want it to. Either way, the default of plain transcription touches no model beyond Whisper.

Before any of that, there's a deterministic layer: a word dictionary. You define replacement rules grouped by category - technical jargon, proper nouns, your own shorthand - and they're applied as a longest-match-first pass on the transcript. It's the reliable way to make sure "kubernetes" and your colleague's unusual name come out right every time, without depending on a model to guess.


What I'd do differently, and what's next

A few honest edges:

  • It isn't notarized yet. Right now Gatekeeper will grumble on first launch and you have to right-click โ†’ Open (or strip the quarantine attribute). Notarization is a paperwork problem, not an engineering one, and it's on the list.
  • The benchmarks are vibes, not numbers. I tuned against perceived latency and my own daily use. A proper benchmark harness - latency per tier per chip, with and without Core ML - would let me make precise claims instead of qualitative ones, and would catch regressions.
  • Apple Silicon only. The whole performance story is built on the Neural Engine and Metal. An Intel or Linux build would mean a CPU-only path and a different perf profile. The core is portable enough to make it possible; it just wasn't the point of v1.
  • The persistent-stream trick keeps the mic warm. The stream is paused, not destroyed, between recordings, so the OS mic indicator behaves - but it's the kind of thing I want to keep an eye on across macOS releases, since TCC behavior shifts.

The architecture leaves room for all of it. Because the core is a standalone Rust library behind a thin FFI, the hard parts - audio, VAD, inference - don't care whether the front end is SwiftUI, a CLI, or something I haven't built yet.


Conclusion

The thing I keep coming back to is that none of this needed the cloud. The frustration I started with - pay a subscription, or send my voice to a server - was never a technical necessity. It was a default. Whisper runs locally. The Neural Engine is sitting there in every Mac. The weights are open. Put a careful Rust core under a native SwiftUI shell, teach it which chip it's on, and you get dictation that's private, free, and fast enough that you stop thinking about it.

That's TapTalk. It's open source, it's MIT-licensed, and it's the tool I now use every day instead of the paid ones.

โ—† KEY TAKEAWAY

Local-first isn't a compromise - it's often the better architecture. On-device Whisper with a Neural Engine encoder, parameters tuned to the exact chip, and a tiny formula that sizes inference to what you actually said, adds up to dictation that's private and fast. No subscription, no audio leaving your Mac, no excuse for the cloud default.


Download

TapTalk is free and open source. Grab the pre-built app or build it yourself:

Building from source is one command:

โ— โ— โ—BASH
git clone https://github.com/vakharwalad23/tap-talk.git
cd tap-talk
make run

If macOS flags the app as "damaged" on first launch, that's just Gatekeeper being cautious about an unnotarized build - clear the quarantine flag:

โ— โ— โ—BASH
xattr -dr com.apple.quarantine "/Applications/TapTalk.app"

References

// next up
ENGINEERING
The Three Levers: Backend-Driven UI, OTA Updates, and the End of Waiting for App Review โ†’
ยฉ 2026 ยท HAND-BUILT W/ โ™ฅ & CAFFEINEBUILT WITH NEXT.JS + TAILWIND
HOMEABOUTBLOGSPROJECTSRESUMEPLAY