Calloptima Logo Calloptima
Piotr Piotr
System Optimization

Speech-to-Speech, No-Code -> Great Tools or Just Hype?

Why no-code fails, bypassing LLMs to cut 700ms latency, and the limits of speech-to-speech models.

Article TL;DR

  • No-code builders hit a hard ceiling; enterprise voice agents demand code-first handlers.
  • Exact string matching for common responses bypasses LLMs, saving 700ms+ latency.
  • Speech-to-speech models remain too expensive and uncontrollable for enterprise.

Speech-to-Speech, No-Code -> Great Tools or Just Hype?

I spoke with Nick Leonard, CEO and co-founder of Voice Run, about deploying voice agents at scale. Here are the core architectural constraints.

The No-Code Ceiling

No-code voice builders fail in production. They work for demos, but break on complex error handling and async tasks.

Enterprise voice agents require code—typically 500 to 5,000 lines of Python for the handler. If an on-prem database query takes 10 seconds, you need an asynchronous background task to prompt the user while waiting. Visual flowcharts lack this control.

Hacking Latency: Skip the LLM

Latency breaks voice UX. Interruptions over high-latency connections cause loops.

Hack: avoid the LLM using exact string matching in state machines. Call centers ask predictable questions: “If disconnected, can I call you back?”

  1. Enumerate valid answers (“Yes,” “No,” “Sure”).
  2. Normalize input (lowercase, strip punctuation).
  3. Exact string match against the dictionary.
  4. Fall back to LLM only on failure (typically 10% of calls).

Exact matching takes 900 nanoseconds. Bypassing the LLM saves ~700 milliseconds. Combine this with cached audio from TTS providers like 11 Labs or Cartesia to hit sub-300ms latency.

Model Agnosticism

Never lock into an audio provider’s proprietary LLM. Decouple audio runtime from reasoning. Orchestrate between providers to route specific tasks to the most cost-effective model.

Speech-to-Speech Limitations

Speech-to-speech models sound natural but fail enterprise constraints:

  • Cost: Drastically higher than cascaded systems.
  • Reliability: Unpredictable behavior and tone.
  • Control: Limited to system prompt tuning. Large prompts cause regressions.

Cascaded systems (STT → LLM → TTS) with a code intelligence layer remain the only viable architecture for integrated enterprise deployments.


Need help with voice architecture?

If your voice agent is hitting a reliability ceiling or latency issues, let’s fix the architecture. Get a free consultation.