3 minutes
Vibe coding update: voice assistant that types anywhere with screenshot context
Following up on my previous article about Vibe Coding, I’m excited to announce a major enhancement to Vibevoice that takes voice interaction to the next level: AI Command Mode with screenshot context.

Vibevoice in action: now with the option to use an LLM and send your screen as context
Beyond Simple Dictation: AI Commands with Visual Context
While dictation mode is for quickly writing text by speaking it, the new AI Command Mode addresses a different challenge: getting intelligent assistance based on what I’m looking at.
Here’s how it works:
- Hold down the Scroll Lock key (chosen because it’s rarely used in modern applications)
- Speak your command or question naturally
- Release the key
- Vibevoice will:
- Transcribe your speech
- Take a screenshot of your current view
- Send both to a local LLM (running through Ollama)
- Stream the AI’s response directly to your cursor position
This new mode turns Vibevoice from a dictation tool into a context-aware AI assistant that can see what you’re working on and respond intelligently.
Real-World Applications
The applications for this are very powerful:
- Code assistance: “Explain what this function does” while looking at code
- Data analysis: “What trends do you see in this chart?”
- Troubleshooting: “Why am I getting this error?” while looking at an error message
- Document editing: “Suggest a better way to phrase this paragraph”
- Email composition: “Draft a response to this email that’s professional but firm”
All of these commands work with any application on your system - your browser, code editor, terminal, or document editor.
Privacy-First Implementation
Unlike cloud-based solutions:
- All processing happens locally on your machine
- Your screenshots never leave your computer
- The LLM runs locally through Ollama
- No data is sent to external servers
So you can even use it offline!
Technical Implementation
The AI Command Mode builds on the same foundation as the dictation mode, with a few key additions:
- Screenshot capability: Using gnome-screenshot to capture your current view
- Local LLM integration: Connecting to Ollama to run models like Gemma3
- Smart context handling: Resizing images and formatting prompts for optimal LLM processing
- Response streaming: Character-by-character typing that mimics natural typing
Getting Started with AI Command Mode
To use this new feature, you’ll need:
- Ollama installed (https://ollama.com)
- A multimodal LLM that supports both text and images:
ollama pull gemma3:27b # Recommended model for RTX 3090 or similar
- Start Ollama in the background:
ollama serve
- Install gnome-screenshot for the screenshot functionality:
sudo apt install gnome-screenshot
Then simply run Vibevoice as usual:
python src/vibevoice/cli.py
Customizing Your AI Experience
You can customize the AI Command Mode with these environment variables:
# Change the AI command key
export VOICEKEY_CMD="f12" # Use F12 instead of Scroll Lock
# Use a different Ollama model
export OLLAMA_MODEL="gemma3:4b" # Smaller model for less powerful GPUs
# Disable screenshots (text-only mode)
export INCLUDE_SCREENSHOT="false"
# Adjust screenshot resolution
export SCREENSHOT_MAX_WIDTH="800" # Smaller screenshots
What’s up next
This update is a big step towards more natural human-computer interaction. By combining voice input, visual context, and local AI processing, Vibevoice creates a workflow that feels almost like collaborating with an intelligent assistant who can see what you see and understand what you need.
As LLMs continue to improve in their multimodal capabilities, the potential for this approach will only grow. I’m particularly excited about future possibilities like:
- More specialized visual understanding (code, charts, diagrams)
- Memory of previous interactions within a session
- More sophisticated actions beyond text responses, e.g. controlling some actions on your computer
Try It Today
If you haven’t tried Vibevoice yet, or if you’re using an older version, now’s the perfect time to update.
The combination of voice dictation and context-aware AI assistance brings us one step closer to Karpathy’s vision of vibe coding - where we can focus on what we want to create, not on the mechanics of creating it.