Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B(github.com)

266 points | by karimf 1 day ago

21 comments

jwr 18 hours ago
That is very, very interesting. I've been hoping to have an assistant in the workshop (hands-free!) that I could talk to and have it help me with simple tasks: timers, calculating, digging up notes, etc. — basically, what the phone assistants were supposed to be, but aren't.
"You will have to unlock your iphone first" is kind of a deal-breaker when you are in the middle of mixing polyurethane resin and have gloves and a mask on.
More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers, preventing us from using the technological advances and holding us back years behind the state of the art.
I'll be trying this out on my Macbook, looks very promising!
[-]
- gtowey 14 hours ago
  The computing power we all have in our pockets is staggering. It could be tool that truly makes our lives easier, but instead it's mostly a device that is frustrating to use. Companies have decided to make it simply another conduit for advertising. It's a tool for them to sell us more stuff. Basic usability be damned.
- jamilton 10 hours ago
  Siri does have a setting that'll activate it if you say "hey siri" while the phone is locked. Obvious privacy and battery usage concerns though, and it's still Siri, so it's a little clunky.
  [-]
  - jwr 10 hours ago
    Mhm. I think I use that. But then I say "call my wife" and it says "you'll need to unlock your iphone first".
    It's clear Tim Cook doesn't ever try to use Siri wearing gloves. Or ever, for that matter :-)
    [-]
    - mft_ 9 hours ago
      Siri (on iOS 18, at least) will call people for me without unlocking, in response to a voice command only - I just double-checked...
- mentalgear 15 hours ago
  You might be interested in the open-source https://www.home-assistant.io/voice-pe/ .
  [-]
  - QuercusMax 10 hours ago
    I've been replacing my Google Homes and Chromecasts with Snapcast streamers, and this is the next thing I've been planning to look into.
    It's truly absurd how the Google voice assistant USED to work properly for setting timers, playing music, etc, and then they had to break it 15 times and finally replace it with much slower AI that only kinda does what you want. I'm done.
    Selfhosted is the way to go if you want to keep your sanity. My wife has basically given up on any Google/Apple voice assistants being able to do anything useful above "set a 10 minute timer".
- huijzer 14 hours ago
  > More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers
  Yes same with RSS readers being dropped by large companies. Worked too good I guess!
magzter 15 hours ago
This is so cool, I'm always speaking to people about how the advancement in the SOTA hosted AI's is also happening in the local model space, i.e. the SOTA hosted AI models 6-12 months ago are what we're seeing now being able to run locally on average hardware - this is such an amazing way to actually demo it.
[-]
dvt 22 hours ago
Solid work and great showcase, I've done a bunch of stuff with Kokoro and the latency is incredible. So crazy how badly Apple dropped the ball... feels like your demo should be a Siri demo (I mean that in the most complimentary way possible).
[-]
- karimf 21 hours ago
  Thank you. This reminds me of a paragraph from the LatentSpace newsletter [0]
  > The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….
  https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...
myultidevhq 12 hours ago
This is really impressive for running locally on an M3 Pro. The latency looks surprisingly good for real-time audio and video input.
Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.
Great work, bookmarking this.
[-]
- karimf 11 hours ago
  During my limited testing, it works better than I expected at handling multiple languages in a single session. Perhaps I just had a low expectation since I've mostly worked with English-only STT models.
est 17 hours ago
I am making something similar. Also been using Kokoro for TTS. Very cool project!
Gemma 4 is kinda too heavyweight even with E2B. I am sticking with qwen 0.8B at the moment.
zerop 18 hours ago
I have been looking forward to build something like this using open models. A voice assisstant I can talk while I am driving, as I do have long commute. I do use chatGPT voice mode and it works great for querying any information or discussions. But I want to do tasks like browsing web, act like a social media manager for my business etc.
crsAbtEvrthng 12 hours ago
If I run this without internet connection it says "loading..." at the bottom of the localhost site and won't work.
If I run this with internet connected it works flawlessly. Even if I disconnect my internet afterwards it still goes on working fine.
Why there has to be an internet connection established at the time I open the localhost site when all of this should be working purely on device?
Despite of this, I am really impressed that this actually works so fast with video input on my M4 Pro 48 GB.
[-]
- matula 1 hour ago
  the index.html is loading remote js files: https://github.com/fikrikarim/parlor/blob/main/src/index.htm...
  I saved them locally and changed the reference, and it worked perfectly.
- karimf 11 hours ago
  Huh that's weird. I just tried it and it works on my machine. Could you perhaps create a GitHub issue and share the reproduction steps and any relevant logs?
  [-]
  - crsAbtEvrthng 10 hours ago
    Don't have the time right now but will play around with it next weekend for sure and will give you more feedback with logs when I see that I can reproduce it.
    For now what I did was:
    - Tested in Chrome/Safari/Firefox on Tahoe.
    - Followed the quick start install instructions from github repo
    - Everything worked
    - Closed terminal
    - Disconnected internet (Wifi off)
    - Opened terminal
    - Started server again (uv run server.py)
    - Opened localhost in browser, it asked for camera/mic normally, granted access, saw camera live feed but "loading..." at bottom center of the site and AI did not listen/respond
    - Reproduced this about 3 times with switching between wifi on/off before starting the server, always the same (working with internet; not working without)
    - Figured it also works fine if I start the server with internet connected and disconnect it afterwards
noodlebreak 7 hours ago
I have to try it out on my idle laptops. I've been meaning to run some models on them for low cost tasks that need AI - like sorting and filtering photos from 100s of thousands that I have amassed over the years. And applying general size reduction compression to the filtered ones.
Btw if anyone has already created such a pipeline/workflow using such models, please lmk!
rubicon33 11 hours ago
Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?
I’ve been looking for a good video summarizing / understanding model!
[-]
- karimf 11 hours ago
  Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.
  In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]
  [0] https://huggingface.co/blog/gemma4#video-understanding
  [-]
  - rubicon33 11 hours ago
    I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.
    sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
    "how packages were delivered over the last hour", etc.
inzlab 5 hours ago
Real time ai sounds like the future
logicallee 15 hours ago
It might interest people to know you can also easily fine-tune the text portion of this specific model (E2B) to behave however you want! I fine-tuned it to talk like a pirate but you can get it to do anything you have (or can generate) training data for. (This wouldn't make it to the text to speech portion though.) So you can easily train it to act a certain way or give certain types of responses.
Video: https://www.youtube.com/live/WuCxWJhrkIM
Generated writeup: https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...
divan 17 hours ago
Can someone quickly vibe code MacOS native app for that so it doesn't require running terminal commands and searching for that browser tab? (: (also for iOS, pls)
[-]
- duartefdias 16 hours ago
  Would you pay 2$ for that MacOS native desktop app?
spwa4 6 hours ago
I've been trying to do this, but I can't get voice recognition to work fast enough (meaning live) with Gemma E2B, on either an M1 max (64GB), a 5060 Ti (16Gb) or a SnapDragon 8 Gen2.
Any pointers?
an0n-elem 12 hours ago
Cool work buddy:)
jareklupinski 4 hours ago
just make it say "Uh...", "umm...", or "hmmm..." once or twice halfway between processing and finish :D
tianqi 14 hours ago
[dead]
redoh 8 hours ago
[dead]
agdexai 13 hours ago
[dead]
techpulse_x 19 hours ago
[dead]
k-almuraee 19 hours ago
Amazing, love your work ,