Slipbox 2.0 Deep Dive

Since launching Slipbox on macOS, we’ve seen steady early adoption. However, with 100.4 million Mac users compared to 1.46 billion iPhone users, it only makes sense to create an iOS version of Slipbox. The iOS app not only offers the private, personalized, and unlimited transcription of the macOS app, but also adds mobility to transcription. Beyond the iOS app, we've also added Speaker Identification to automatically label different voices and BYOK to give users control over their AI providers. Running both Speaker Identification and iOS transcription locally was only made possible by Fluid Inference, a platform that enables users to run AI models on devices efficiently. Their FluidAudio package enabled both features to work on Apple's Neural Engine.

Slipbox on iOS

Since people carry their phones wherever they go, they can transcribe conversations anywhere they happen. This isn't practical with a Mac. You can't predict when you'll need to transcribe something, and carrying a laptop everywhere isn't realistic. Furthermore, Macs need to stay open because background processes, like transcription, are automatically suspended when it goes to sleep. The iOS app also includes mobile-specific features like Live Activity, where you can watch your conversation getting transcribed directly from your lock screen.

Live Activity with real-time transcription in the background

To make real-time transcription with Live Activity possible, we needed to run transcription locally while the app is backgrounded. Initially, we tried to use Whisper models, but ran into several issues. The larger Whisper models had the accuracy we needed, but were too slow for real-time transcription and took minutes to load. Smaller models loaded quickly and worked in real time, but lacked accuracy. Furthermore, Apple prohibits GPU usage while apps are backgrounded. This is a huge problem since using the CPU for real-time transcription drains the battery fast, which makes recording long conversations difficult. All of these factors combined make Whisper models impractical for real-time transcription.

Fluid Inference solved this issue for us through their FluidAudio package. It enabled us to run smaller and accurate models like nvidia/parakeet-tdt-0.6b-v3 in real time on Apple’s Neural Engine (ANE). Unlike GPUs, ANE isn’t blocked when the app is backgrounded and is designed to be energy efficient, which was exactly what we needed for the iOS app. Although Parakeet solved our performance issues, it doesn't support as many languages as Whisper. This means the iOS app will support fewer languages for the foreseeable future because adding more language support to Parakeet would make the model too big for phones.

Speaker Identification

With speaker identification, Slipbox detects different voices in real time and assigns each one a label like Speaker 1 or Speaker 2. Users can then rename these speaker labels to real names, and since you see the labels appear in real time, you know exactly who Speaker 1 and Speaker 2 are. This not only makes the transcript a lot easier to read, but also improves the accuracy of summaries. Additionally, Slipbox stores each speaker's voice fingerprint privately on-device, so it can recognize the voice across future recordings.

Once you've renamed Speaker 1 to 'Sarah', Slipbox will recognize Sarah's voice in future meetings. Even if Slipbox misidentifies someone, you can easily correct it with the reassign button to keep your transcript accurate.

In order for real-time speaker identification to work, we needed to run diarization locally, as part of Slipbox's local-first approach to protect user privacy. At first, we tried using sherpa-onnx to run diarization on the CPU, but this added too much latency and was too power-hungry for real-time use, which was a big problem for the iOS app. Instead, we moved on to using Fluid Inference’s platform to convert existing diarization models to run on the ANE, which made speaker detection fast and efficient enough for Slipbox on macOS but most importantly, also on iOS, making Slipbox one of the first iOS apps to do so completely locally!

BYOK (Bring Your Own Key)

BYOK, sometimes also known as BYOC (Bring Your Own Cloud), lets you use your preferred LLM provider for summaries and chat with your API keys. Instead of sending transcripts to Slipbox servers and then to LLM providers, transcripts go directly to your chosen provider's endpoints. On Macs, BYOK also lets you use open-source models running locally for summarization and chat. With either option, users don't have to worry about privacy as none of your transcripts go to our servers.

One limitation with BYOK is compatibility with open-source models. Open-source models all behave a little differently, which can lead to different bugs. For example, during testing we found that some Phi models occasionally produced a summary with a blank title while other models like Qwen would behave as expected. With so many widely used models, it’s been challenging to guarantee 100% compatibility across all of them.

Even if open-source models improve enough to handle summaries and chat entirely on-device, BYOK will remain valuable for the foreseeable future. Users who still prefer to use more powerful LLMs can do so with their API keys, while others can select from different open-source models based on their preferences. BYOK ensures you're never locked into a single approach.

Performance

iOS transcription is faster and more accurate than macOS transcription because it uses Parakeet, which outperforms Whisper on OpenASR benchmarks with 9-24x faster transcription and slightly better accuracy. However, the iOS app can only transcribe microphone input since iOS doesn't allow system audio recording. For everyday use, the iOS app is better for in-person conversations, while the macOS app is better for virtual meetings or anything that involves system audio.

Next Steps

With Slipbox now available on both iOS and macOS, many users will want their transcripts accessible across both platforms. To support this, we're adding iCloud syncing where users can choose if they want their transcripts and summaries to be uploaded to iCloud. Support for European languages is also coming to iOS, thanks to Fluid Inference's ongoing work. We're also expanding platform support to Windows devices with an early preview now available for Intel AI PCs on the Microsoft Store. Finally, we’re improving BYOK with support for more cloud providers like Anthropic and Gemini, along with improved compatibility with open-source models. If there's a specific model or provider you'd like to see supported, please reach out to us at [email protected].