Izwi Update: Local Speaker Diarization, Forced Alignment, and better model support

February 17, 2026
Izwi Update: Local Speaker Diarization, Forced Alignment, and better model support

Here's something that caught my attention — Izwi, the local audio inference engine, just got a major update. And honestly, it’s a game-changer for anyone working with speech tech. According to /u/zinyando on Reddit, they’ve added speaker diarization, which can automatically tell apart multiple voices using Sortformer models — think meeting transcripts made way easier. Plus, forced alignment now offers word-level timestamps with Qwen3-ForcedAligner, perfect for subtitles or video captions. And get this — there’s now real-time streaming support for transcribing, chatting, and text-to-speech, delivering responses incrementally. Native support for popular audio formats like WAV, MP3, and OGG makes things smoother too, thanks to Symphonia. As /u/zinyando points out, they've also boosted performance with parallel execution and Metal optimizations. Compatibility-wise, they’ve got a ton of models for TTS, ASR, and chat, including Qwen3 and Gemma 3. So, if you’re into speech tech, check out their docs and give it a spin — feedback and stars on GitHub are encouraged. Truly impressive stuff.

Quick update on Izwi (local audio inference engine) - we've shipped some major features:

What's New:

Speaker Diarization - Automatically identify and separate multiple speakers using Sortformer models. Perfect for meeting transcripts.

Forced Alignment - Word-level timestamps between audio and text using Qwen3-ForcedAligner. Great for subtitles.

Real-Time Streaming - Stream responses for transcribe, chat, and TTS with incremental delivery.

Multi-Format Audio - Native support for WAV, MP3, FLAC, OGG via Symphonia.

Performance - Parallel execution, batch ASR, paged KV cache, Metal optimizations.

Model Support:

  • TTS: Qwen3-TTS (0.6B, 1.7B), LFM2.5-Audio
  • ASR: Qwen3-ASR (0.6B, 1.7B), Parakeet TDT, LFM2.5-Audio
  • Chat: Qwen3 (0.6B, 1.7), Gemma 3 (1B)
  • Diarization: Sortformer 4-speaker

Docs: https://izwiai.com/
Github Repo: https://github.com/agentem-ai/izwi

Give us a star on GitHub and try it out. Feedback is welcome!!!

submitted by /u/zinyando
[link] [comments]
Audio Transcript

Quick update on Izwi (local audio inference engine) - we've shipped some major features:

What's New:

Speaker Diarization - Automatically identify and separate multiple speakers using Sortformer models. Perfect for meeting transcripts.

Forced Alignment - Word-level timestamps between audio and text using Qwen3-ForcedAligner. Great for subtitles.

Real-Time Streaming - Stream responses for transcribe, chat, and TTS with incremental delivery.

Multi-Format Audio - Native support for WAV, MP3, FLAC, OGG via Symphonia.

Performance - Parallel execution, batch ASR, paged KV cache, Metal optimizations.

Model Support:

  • TTS: Qwen3-TTS (0.6B, 1.7B), LFM2.5-Audio
  • ASR: Qwen3-ASR (0.6B, 1.7B), Parakeet TDT, LFM2.5-Audio
  • Chat: Qwen3 (0.6B, 1.7), Gemma 3 (1B)
  • Diarization: Sortformer 4-speaker

Docs: https://izwiai.com/
Github Repo: https://github.com/agentem-ai/izwi

Give us a star on GitHub and try it out. Feedback is welcome!!!

submitted by /u/zinyando
[link] [comments]
0:00/0:00
Izwi Update: Local Speaker Diarization, Forced Alignment, and better model support | Speasy