Vibe coding update: voice assistant that types anywhere with screenshot context

Following up on my previous article about Vibe Coding, I’m excited to announce a major enhancement to Vibevoice that takes voice interaction to the next level: AI Command Mode with screenshot context.

Demonstration of Vibevoice's AI Command Mode with screenshot context

Vibevoice in action: now with the option to use an LLM and send your screen as context

Beyond Simple Dictation: AI Commands with Visual Context

While dictation mode is for quickly writing text by speaking it, the new AI Command Mode addresses a different challenge: getting intelligent assistance based on what I’m looking at.

Here’s how it works:

Hold down the Scroll Lock key (chosen because it’s rarely used in modern applications)
Speak your command or question naturally
Release the key
Vibevoice will:
- Transcribe your speech
- Take a screenshot of your current view
- Send both to a local LLM (running through Ollama)
- Stream the AI’s response directly to your cursor position

This new mode turns Vibevoice from a dictation tool into a context-aware AI assistant that can see what you’re working on and respond intelligently.

Real-World Applications

The applications for this are very powerful:

Code assistance: “Explain what this function does” while looking at code
Data analysis: “What trends do you see in this chart?”
Troubleshooting: “Why am I getting this error?” while looking at an error message
Document editing: “Suggest a better way to phrase this paragraph”
Email composition: “Draft a response to this email that’s professional but firm”

All of these commands work with any application on your system - your browser, code editor, terminal, or document editor.

Privacy-First Implementation

Unlike cloud-based solutions:

All processing happens locally on your machine
Your screenshots never leave your computer
The LLM runs locally through Ollama
No data is sent to external servers

So you can even use it offline!

Technical Implementation

The AI Command Mode builds on the same foundation as the dictation mode, with a few key additions:

Screenshot capability: Using gnome-screenshot to capture your current view
Local LLM integration: Connecting to Ollama to run models like Gemma3
Smart context handling: Resizing images and formatting prompts for optimal LLM processing
Response streaming: Character-by-character typing that mimics natural typing

Getting Started with AI Command Mode

To use this new feature, you’ll need:

Ollama installed (https://ollama.com)

A multimodal LLM that supports both text and images:

ollama pull gemma3:27b  # Recommended model for RTX 3090 or similar

Start Ollama in the background:
```
ollama serve
```
Install gnome-screenshot for the screenshot functionality:
```
sudo apt install gnome-screenshot
```

Then simply run Vibevoice as usual:

python src/vibevoice/cli.py

Customizing Your AI Experience

You can customize the AI Command Mode with these environment variables:

# Change the AI command key
export VOICEKEY_CMD="f12"  # Use F12 instead of Scroll Lock

# Use a different Ollama model
export OLLAMA_MODEL="gemma3:4b"  # Smaller model for less powerful GPUs

# Disable screenshots (text-only mode)
export INCLUDE_SCREENSHOT="false"

# Adjust screenshot resolution
export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots

What’s up next

This update is a big step towards more natural human-computer interaction. By combining voice input, visual context, and local AI processing, Vibevoice creates a workflow that feels almost like collaborating with an intelligent assistant who can see what you see and understand what you need.

As LLMs continue to improve in their multimodal capabilities, the potential for this approach will only grow. I’m particularly excited about future possibilities like:

More specialized visual understanding (code, charts, diagrams)
Memory of previous interactions within a session
More sophisticated actions beyond text responses, e.g. controlling some actions on your computer

Try It Today

If you haven’t tried Vibevoice yet, or if you’re using an older version, now’s the perfect time to update.

The combination of voice dictation and context-aware AI assistance brings us one step closer to Karpathy’s vision of vibe coding - where we can focus on what we want to create, not on the mechanics of creating it.

Check it out here