Izwi Update: Local Speaker Diarization, Forced Alignment, and better model support
Here's something that caught my attention — Izwi, the local audio inference engine, just got a major update. And honestly, it’s a game-changer for anyone working with speech tech. According to /u/zinyando on Reddit, they’ve added speaker diarization, which can automatically tell apart multiple voices using Sortformer models — think meeting transcripts made way easier. Plus, forced alignment now offers word-level timestamps with Qwen3-ForcedAligner, perfect for subtitles or video captions. And get this — there’s now real-time streaming support for transcribing, chatting, and text-to-speech, delivering responses incrementally. Native support for popular audio formats like WAV, MP3, and OGG makes things smoother too, thanks to Symphonia. As /u/zinyando points out, they've also boosted performance with parallel execution and Metal optimizations. Compatibility-wise, they’ve got a ton of models for TTS, ASR, and chat, including Qwen3 and Gemma 3. So, if you’re into speech tech, check out their docs and give it a spin — feedback and stars on GitHub are encouraged. Truly impressive stuff.