March 19, 2026

Insights

How to Use Voice Notes With Your AI Agent

Send a voice note. Get a text response. It's the fastest way to interact with your agent — and most people don't know it works.

Team Tulip

Quick Answer

Send a voice note to your agent via WhatsApp or Telegram. The agent transcribes it, processes the request, and responds in text. This works instantly with OpenClaw. Voice is faster than typing for task delegation, brain dumps, and giving detailed instructions. Accuracy depends on clear speech, but most people get 95%+ accuracy. Always-on hosting means your agent processes voice notes even when your laptop is off.

Why Voice is Faster Than Typing

Most people think of AI agents as text interfaces: you type a message, the agent responds. This is inefficient for several reasons:

Typing is slow: Average person types 40-60 words per minute. Average person speaks 120-150 words per minute.
Context switching: Stopping to type interrupts your workflow.
Details: It's easier to talk through complex instructions than to type them out.
Hands-free: Voice works while you're walking, driving, or hands-occupied.

Voice notes eliminate these friction points. You talk like you normally do. The agent transcribes and acts. No typing, no interruption, 2-3x faster input.

How Voice Notes Work With Your Agent

The Flow

You send a voice note via WhatsApp or Telegram.
Agent receives the voice file and audio metadata.
Agent transcribes the audio to text (using OpenAI Whisper or similar).
Agent processes the transcribed text as a normal request.
Agent responds in text (or audio if configured).

Transcription Models

OpenAI Whisper is the current standard. It's open-source and highly accurate:

Accuracy: 95-99% for clear English speech
Accents: Handles most accents reasonably well
Background noise: Tolerates light background noise (office, street) but struggles with heavy noise (concert, traffic)
Cost: ~$0.02 per minute of audio
Speed: Transcription is instant (under 1 second for typical voice note)

You can also use local models (Vosk, OpenAI's Whisper locally) for privacy, but they're slower and less accurate.

Setup: Getting Voice Notes to Work

With OpenClaw (Easiest)

Voice notes work out of the box on OpenClaw with WhatsApp and Telegram:

Connect your agent to WhatsApp or Telegram.
Send a voice note to your agent.
The agent automatically transcribes and responds.

No additional setup required. OpenClaw handles transcription and integration automatically.

Configuration

In your agent's SOUL.md, you can customize voice behavior:

## Voice Notes
- Accept voice notes via WhatsApp and Telegram
- Transcribe using OpenAI Whisper
- If transcription confidence is below 80%, ask for clarification
- Respond in text (voice responses not enabled)
- Maximum voice note duration: 5 minutes

Custom Setup

If you're building a custom agent, you need:

A messaging platform integration (WhatsApp, Telegram, etc.).
Audio file handling (receive, store, retrieve voice files).
A transcription service (Whisper API, Google Cloud Speech, etc.).
Text processing (your normal agent logic).

Pseudocode:

def handle_voice_note(audio_file):
    # Transcribe
    text = transcribe_audio(audio_file)
    
    # Process like normal text
    response = agent.process(text)
    
    # Send back
    send_response(response)

That's the entire flow.

Best Use Cases for Voice Notes

Quick Task Delegation

You're busy. You have a quick task for your agent. Rather than stop to type:

"Hey, add a meeting with Sarah Tuesday at 2pm and send her a calendar invite."

Spoken in 5 seconds. Typed: 30 seconds. Voice is 6x faster for this type of request.

Brain Dumps and Journaling

You have half an hour to think. You want to capture ideas:

"Okay, thinking about the new product. Feature ideas: better onboarding, dark mode, API access, batch operations, pricing tiers. Also thinking competitor X just released something similar. Need to research and summarize by Friday."

Voice note: 30 seconds. Your agent extracts and structures the ideas. Written as a document later.

Dictating Emails and Messages

You need to reply to an email but you're not at your desk:

"Hey, reply to John's email. Thanks for the proposal. Looks good, let's schedule a call next week. Ask about timeline and integrations."

Agent drafts the email and sends it (or queues it for review). You just spoke 15 seconds of instructions instead of typing an email.

Detailed Instructions

You're explaining a complex workflow:

"Okay, so the monthly reconciliation works like this. We pull data from Stripe, match it against the accounting system, flag any discrepancies, and generate a report. If there are discrepancies over 100 dollars, pause and ask me. Otherwise, automatically post the reconciling entries. I need this done by the 3rd of every month, so set it up to run on the 2nd at 6am."

This is one long voice note. Typing it out would take 5+ minutes. Speaking: 45 seconds.

Accuracy and Troubleshooting

High Accuracy (95%+)

Most of the time, transcription is perfect. Whisper is excellent.

What makes transcription accurate:

Clear speech: Normal speaking voice, not mumbling
Quiet environment: Home office, empty room
Familiar language: English, with common accent
Technical terms: Whisper is surprisingly good at technical vocabulary

Low Accuracy (70-85%)

Sometimes transcription misses words or gets context wrong:

Noisy environment: Busy cafe, car traffic, street noise
Thick accents: Very strong regional or non-native accent
Proper nouns: Names, company names, specialized terms might be wrong
Homonyms: "to", "too", "two" sound the same; transcriber guesses
Fast speech: Talking too quickly degrades accuracy

Improving Accuracy

If you're getting transcription errors:

Slow down slightly. Normal pace is fine, but rushing hurts.
Enunciate. Clear pronunciation helps.
Quiet environment. Voice notes from a busy cafe will be rough.
Short notes. 30-60 seconds is optimal. Longer notes accumulate more errors.
Spell out proper nouns: "The client is Smith, S-M-I-T-H, from Acme Corp."
Confirm misheard words: Agent can ask "Did you mean 'analytics' or 'analysis'?"

For high-stakes transcription (legal, medical, financial), consider having the agent ask for confirmation: "I transcribed this as '[text]'. Is that correct?"

Voice Note Tips and Tricks

Formatting Hints

When giving instructions, structure them clearly:

"Three things. One: find competitors in the AI space. Two: summarize their pricing models. Three: identify gaps where we could compete. Go."

Instead of rambling. The numbered structure helps the agent parse what you want.

Clarity Over Speed

You don't need to rush. Speak at normal pace with clear pronunciation. Rushing introduces more errors than it saves time.

Confirm Complex Information

For important details:

"Send an invoice to John Smith, email john@example.com, amount 5000 dollars. Repeat that back to me."

Agent confirms: "Sending invoice to john@example.com for $5,000." You confirm it's right.

Context Helps

If you're asking about an ongoing project, remind the agent:

"Hey, we're still working on the Q2 product roadmap. Add these features: dark mode, SSO, batch API operations."

Instead of just: "Add dark mode, SSO, batch API operations."

The context helps the agent know what project you're referring to.

Limitations of Voice Notes

Transcription Accuracy is Not 100%

Most of the time, you'll get 95%+ accuracy. Sometimes: 85%. Rarely: 70%. Edge cases exist. Always proofread if the task is important.

Accents and Unfamiliar Speech

Whisper is trained on a lot of English, but very strong accents or non-native speakers might get 85% accuracy instead of 95%.

Technical Terms and Jargon

Whisper is good with tech vocabulary, but rare or proprietary terms might be misheard. "Kubernetes" becomes "Kubernetes", usually correct. "SentientAI" (made-up company) might become "Sentimental AI".

No Streaming Response

You speak. The agent transcribes, processes, and responds. Total time: 2-5 seconds. It's fast, but not instant like talking to a person. If you want a conversation, voice is slower than text-based interaction.

Hosting on Tulip: Always-On Voice Processing

Running your agent on Tulip (or similar always-on hosting) means:

Your agent processes voice notes even when your laptop is off.
Voice notes are processed instantly (no delay waiting for your computer to wake up).
Transcription happens in the cloud, so CPU-heavy processing doesn't slow your machine.
Logs and records are stored in the cloud, so you can review what your agent did.

Without always-on hosting, if your agent runs on your laptop: voice note arrives, but your agent is asleep until you open your laptop. You lose the real-time responsiveness that makes voice notes useful.

Tulip's pricing model (per-agent, not per-voice-note) means processing 100 voice notes per month costs the same as processing 10. It scales.

Combining Voice with Other Inputs

Your agent can accept voice notes, text messages, emails, and Slack messages simultaneously. Different input types for different contexts:

Voice notes: Quick tasks while away from desk.
Text messages: When you need to be quiet (meetings, shared spaces).
Email: Longer, formal requests.
Slack: Work tasks within your workspace.

The agent treats them all equally. One agent, multiple input channels, same logic.

FAQ

Does Whisper require internet?

The Whisper API requires internet. If you want to transcribe offline, use Vosk or local Whisper (slower, less accurate). For always-on agents on Tulip, internet is already required, so Whisper API adds no overhead.

Is my voice data stored or used for training?

OpenAI doesn't train on Whisper API calls (by default). They keep records for abuse detection for 30 days, then delete. Your voice note itself isn't used to improve Whisper. If you use local Whisper, nothing is stored or transmitted.

Can the agent respond with voice?

Yes. Text-to-speech (TTS) lets the agent send back audio responses instead of text. Combine voice-in + voice-out for a truly hands-free interaction. TTS adds about $0.01 per minute of response.

What file formats work?

Whisper accepts: MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM, OGG. WhatsApp and Telegram send audio in their native formats; OpenClaw converts automatically.

How long can a voice note be?

OpenAI Whisper supports up to 25MB, which is about 1 hour of audio. Practically, keep voice notes short (under 5 minutes). Longer notes degrade accuracy and are harder to process.

What if the agent misunderstands a voice note?

Build in confirmation. Agent transcribes, repeats it back, and asks "Is this correct?". If the user corrects it, the agent learns. Explicitly prompt the agent to ask for confirmation on important tasks.

Can I use voice notes in other languages?

Whisper supports 99 languages. Accuracy is highest for English, high for Spanish and French, decent for most others. Mix languages in one note and Whisper handles it.

Continue reading

View all blogs