The Surface Area of Disappointment: Hard Voice AI production lessons from a $3.2M a16z-backed founder
Hi, I’m Piotr. I recently sat down with Ophir Samson, founder and CEO of Ezra AI Labs, a Voice AI interviewing platform backed by a $3.2M seed round co-led by Penny Jar Capital and LMNT Ventures, with additional investment from a16z Speedrun and Telegraph Hill Capital. Ezra is built on three years of R&D into Voice AI for recruiting and is live with large enterprise customers who have run detailed pilots. Ophir built very similar systems to ones I have shipped, and the conversation mapped almost one-to-one onto the failure modes I see with clients. Here are the lessons worth writing down.
Every Voice AI team rebuilds their stack three times
Ophir described a journey every engineer I have worked with recognizes:
- Stage 1: “This is three APIs in a trench coat.” Speech to text, LLM, text to speech. Chain them, ship it. Then discover that latency is absurd and you need orchestration, turn detection, voice activity detection, interruption handling, and about twelve other things.
- Stage 2: “The framework will solve it.” Try LiveKit, Pipecat, or something similar. Plug in Deepgram, OpenAI, ElevenLabs. Realize that the framework handles plumbing, not product. The real engineering work starts here, it does not end here.
- Stage 3: “Speech-to-speech will save us.” Drop in a magical model that takes audio in and returns audio out. Discover that it is not reliable, you cannot control it, function calling does not work, and your guardrails evaporate.
Most teams I see are somewhere between stage 1 and stage 2 and think they are close. They are not. The stack that wins in production is, in Ophir’s words, “extremely complicated.” You cannot skip the complexity, you have to earn your way through it.
The surface area of disappointment
This is the line from the conversation I will be stealing for years.
“The better the Voice AI sounds, the larger the surface area of disappointment. People assume it can do so much more than what it can actually do.”
Humans have spent hundreds of thousands of years learning what a good audio conversation sounds like. We are extremely sensitive to weirdness in voice. Websites have existed for a few decades. The quality bar on a Voice AI interaction is much higher than on a website, and the gap between “sounds human” and “behaves human” is where users break.
I have seen this first hand. One of my own deployments sounded better when we used ElevenLabs, and that was exactly the problem. Every glitch became jarring because the baseline was so convincing. Picking a slightly less polished voice raised tolerance and reduced frustration.
If you are building, plan for this asymmetry. The closer you get to human-level output, the more your failures stand out.
Start from the user, not the stack
The right sequence is:
- Talk to real end users. Ophir did deep research with candidates before touching the product. Candidates told him exactly what they hated: interruptions mid-thought, being rushed when they were still thinking, sounding like an interrogation, robotic voices, uncanny AI avatars.
- Translate that into concrete product constraints.
- Then pick the stack that can deliver those constraints.
Most teams do it backwards. They pick a framework, wire in three providers, demo it, and only then discover that their users hate it for reasons they never tested.
Once you know what good looks like, you need evals. I have written about this before and Ophir put it bluntly: someone will find the bugs in your product, and you want that someone to be an eval agent, not your customer.
There is no magic VAD, no magic framework
Everyone wants a silver bullet for turn detection. “Which VAD model? Deepgram? Speechmatics? Something else?” The honest answer is that none of those choices will get you where you need to be on their own. The product has to know when a candidate is pausing to think, when they are done, when they are hedging, and when they are done for real. No off-the-shelf VAD does this out of the box.
Ophir put it perfectly: “90% of the work is the perspiration, not the inspiration.” This matches my own experience. I tried Google STT on a small project recently and it worked on the first try, because Google has wrapped years of infra around a decent model. A raw state-of-the-art model with no supporting system around it loses to a mediocre model wrapped in good engineering almost every time.
The long tail is the business
The first 60 to 90% of any Voice AI use case is easy. The last 10% is where the business actually lives:
- Compliance and legal risk. Healthcare, recruiting, financial services. The happy path is not what gets you sued.
- Nefarious callers. Candidates who try to jailbreak the system, cheat, or otherwise abuse it. Enterprise buyers care deeply about this.
- Accents, code-switching, background noise. The audio conditions your demo never tests.
- Structure and guardrails. Recruiters want exactly seven questions asked, in a specific order, with follow-ups that stay on rails. That is not a thing you get from an LLM making decisions in the moment. It is a thing you engineer deterministically around the LLM.
Demos can pretend the long tail does not exist. Production cannot.
If you are navigating these exact production hurdles and need a structured approach to stabilizing your Voice AI stack, feel free to get a free consultation.
No shortcuts to production
“That is not stuff you can vibe code.”
Latency obsession, follow-up structure, interruption handling, background noise, guardrails, observability. These are engineering problems that require understanding the use case, the failure modes, and the customers. Cursor will not do it for you. Claude Code will not do it for you. Maybe in two years. Not today.
Ophir spends hours every day watching real interviews and looking at data. He has a full observability layer on top of the live conversation plus a separate monitoring system for post-interview evaluation. This is the boring part nobody shows at demo day, and it is the part that wins enterprise deals.
The enterprise lesson: expectation management
Asked for his single biggest piece of advice for anyone selling Voice AI into enterprises, Ophir said:
“Do not underestimate the importance of expectation management. Before you try to build anything, be super clear and on the same page with the customer about what Voice AI can and cannot do.”
Back to the surface area of disappointment. If your customer walks in expecting a five-year-old’s reasoning skills and you ship something that can only reliably do the specific task you scoped, you have a CSAT problem that no amount of latency tuning will fix. Manage expectations first, build second.
The takeaway
- Expect to rebuild your Voice AI stack about three times. Do not try to skip the complexity, earn your way through it.
- Plan for the surface area of disappointment. The better your voice sounds, the more visible your failures become.
- Start from the user. Do research, define what “good” looks like, then pick the stack.
- Ship evals. Find the bugs before your customers do.
- Engineer the long tail. Compliance, nefarious use, guardrails, and accents are where production lives.
- Before you sell into an enterprise, spend as much time on expectation management as you do on the tech.
Thanks to Ophir for the conversation. He is one of the sharper operators in this space and worth following if you are building anything in Voice AI.
Let’s find out where Voice AI actually fits.
Before you commit budget or lock into a vendor, it pays to define the exact boundaries of your deployment. Book a free consultation to discuss your current operations and see if an objective assessment can help you separate the hype from actual ROI.
For more insights on scaling Voice AI in production, you can also join the newsletter.