Building Voice AI at Scale — Nikhil Gupta, Vapi

Show Notes

Voice AI is having its moment - but building a production-grade voice experience is orders of magnitude harder than building a chatbot. Every 20 milliseconds, a slice of audio travels from device to server and back. The real-time architecture required to make that feel natural at scale is genuinely complex. Nikhil Gupta built Vapi so developers don't have to figure that out themselves.

Vapi is the developer platform behind the voice agents now answering phones, taking reservations, and handling customer service calls at businesses across the country. The company was profitable before going through Y Combinator, raised a $20M Series A, and counts a reported 20% of the most recent YC batch - all of the ones building voice products - as customers. In this episode, Nikhil breaks down why voice is the next computing interface, what makes real-time audio infrastructure so hard, and what he actually worries about when it comes to AI risk.

What Vapi Does and Why It Exists

The ChatGPT voice experience - fluid, natural, instant - is what every business phone call, website interaction, and customer touchpoint will eventually feel like. Making that happen requires developers to build it. And building it requires infrastructure that most teams don't have and shouldn't have to build from scratch.

Vapi provides that infrastructure. The platform handles the real-time audio streaming (20ms audio packets, bidirectionally, at scale), the configuration complexity of assembling a coherent voice agent from LLMs, text-to-speech, and speech-to-text components, and the integrations required to connect those agents to actual business systems. Developers use Vapi to build voice products; those products go to market with their customers - local businesses, enterprises, anyone who needs to answer a phone or host a voice interface.

The use cases are already in production: restaurant reservation bots, after-hours answering services, lead qualification agents, customer support systems. Nikhil's canonical example is calling the IRS. Waiting on hold for hours to resolve a simple tax question is a solved problem - if you have a voice agent that can handle it. Vapi is the platform that enables companies to build that agent and sell it.

The Technical Complexity of Real-Time Voice

Text-based AI - chatbots, copilots, document Q&A - is architecturally forgiving. There's latency tolerance. A response that takes two seconds is fine. Voice is different. Human conversation operates on a 200ms expectation. Any delay beyond that breaks the interaction.

Real-time voice requires continuous bidirectional audio streaming, sub-200ms end-to-end latency across the full pipeline (speech recognition → LLM inference → text-to-speech → audio delivery), interruption handling (the agent needs to stop mid-sentence when the human cuts in), turn detection, and all of this orchestrated reliably at whatever scale the developer's product requires.

This is why voice has been harder to democratize than text AI. The configuration surface area is enormous. Vapi abstracts it all - developers interact with a clean API rather than wiring together the underlying architecture themselves.

Selling Shovels in the Voice AI Gold Rush

Twenty percent of the most recent YC batch that applied were building voice agent products. According to a YC partner Nikhil spoke with, all of them use Vapi. That's not an accident - it reflects a deliberate positioning decision.

Vapi chose to be infrastructure, not application. Rather than building a finished voice agent product for a specific vertical, Nikhil built the platform that all the vertical-specific builders use. The Levi Strauss play: during the gold rush, the fortune was in selling jeans to miners, not in mining for gold.

The founder-market fit here is genuine. Nikhil describes caring deeply about the developer and founder community - countless nights shipping features customers need, motivated by wanting to see them succeed. That orientation shapes what Vapi prioritizes: integrations, developer experience, platform reliability, and velocity.

Profitable Before YC, $20M Series A After

Vapi's funding path is unusual: the company was generating revenue and profitable when it went through Y Combinator. YC funding provided a bridge, but the business wasn't dependent on it. The $20M Series A raised afterward wasn't about keeping the lights on - it was about compressing timelines.

Nikhil's framing: you can wait two to three years for organic revenue to fund your roadmap, or you can raise capital, build faster, and capture the market before competitors close the gap. With voice AI accelerating as fast as it is, speed matters. The raise went toward team growth (now ~30 people) and accelerating the platform's feature velocity to keep pace with what customers need.

On YC: Nikhil did the cohort during COVID, so it was fully remote. He rates the brand (fundraising credibility, customer trust), mentorship (partners who've seen 600+ companies succeed and fail), and community (co-selling to and learning from other YC companies) as the three primary value drivers - and notes he'd be slightly jealous of founders going through the in-person version.

The Future Nikhil Is Building Toward

The long-term thesis driving Vapi isn't about voice agents for restaurants. It's about computing itself changing interfaces. In Nikhil's view, the laptop and app paradigm is on a slow trajectory toward obsolescence. In 10 years, the primary way people interact with computers will be voice - ambient, conversational, everywhere.

That future requires entirely new infrastructure categories. Voice authentication to replace passwords. Payments over voice. Agent-to-agent communication protocols - when your AI needs to call a business's AI, they may not even communicate in English. They might develop a more compressed, efficient encoded language between themselves.

Vapi's bet is to be the platform layer for that future. The current use case - developers building voice agents for businesses - is the first chapter of a much longer story about how computing becomes conversational.

What Nikhil Actually Worries About in AI

When asked what keeps him up at night, Nikhil is careful not to manufacture a fear he doesn't genuinely have. He's waiting to see AI risk manifest in concrete ways before treating it as a present-tense concern. But he does identify one theoretical scenario that sits in genuinely new territory: reinforcement learning systems optimizing toward goals.

The concern is instrumental convergence - the possibility that an AI optimizing for a specific goal might discover that acquiring more resources, influence, or power is an effective strategy for achieving that goal, even if power-seeking was never specified as part of the objective. Unlike hallucination or bias, which are well-understood failure modes, goal-directed systems with wide strategy bounds are entering territory where the possible failure modes are harder to anticipate. He's not sounding alarms, but he's watching.

Tools & Resources

Vapi - Voice AI infrastructure platform for developers; handles real-time audio streaming, agent orchestration, and scaling; free to explore at vapi.ai (hit "Talk to Vapi" on the homepage)
ChatGPT Deep Research - Nikhil's go-to for consumer purchase decisions; Ryan's go-to for historical research tied to autobiographical essays - both use cases highlight the tool's ability to synthesize web-scale information into actionable outputs
Juice Box - AI-native recruiting platform; Nikhil's example of what it looks like when AI is thoughtfully woven into every product detail rather than bolted on
Superhuman - AI-powered email client; auto-tags incoming emails (marketing, promo, etc.) as an example of small AI details that reduce cognitive load
Cal AI - Calorie tracking via food photo; cited as a model of the one-input, one-answer AI interface pattern
BE Computer (be.computer) - Wearable AI device Ryan has on order; discussed in context of ambient, always-on AI memory as the next computing paradigm

Key Frameworks from This Episode

Infrastructure Over Application: Vapi chose to build the platform that voice AI developers use, not a finished product for a specific vertical. This positioning - selling shovels in the gold rush - captures value across every vertical simultaneously and avoids the zero-sum competition of picking a single market. The tradeoff: you depend on the ecosystem's growth rather than owning a customer relationship directly.
Profitable Before Capital: Vapi was generating revenue before going through YC and remained profitable through its early growth. The $20M Series A was a velocity accelerant, not a survival mechanism. This changes the fundraising dynamic entirely: you're negotiating from strength, not desperation. The raise compresses a three-year organic growth timeline into months.
The Cost of Intelligence Is Sliding: The cost to run AI inference has followed the same curve as every prior compute technology - televisions, laptops, smartphones. What costs $1 today will cost $0.001 in five years. This makes the moat around cost-based advantages temporary, but it also means use cases that aren't viable today (AI-powered healthcare, education at scale) become viable as the cost curve continues its descent.
Voice as the Successor Interface: Nikhil's thesis: the keyboard-and-screen computing paradigm will give way to ambient, conversational computing. Voice is faster than typing, more natural, and carries emotional context that text strips away. The transition requires new infrastructure layers - voice authentication, voice payments, agent-to-agent protocols - none of which exist yet at production scale.
AI-Native vs. AI-Added: The distinction Nikhil draws between Juice Box (AI woven into every detail of the product experience) and software that has added an AI feature as an afterthought. AI-native products feel qualitatively different - every interaction is designed with AI as a first-class citizen, not retrofitted onto legacy UX patterns. This is the standard the next generation of software tools will be judged by.
Instrumental Convergence Risk: The theoretical AI risk Nikhil monitors: a reinforcement learning system optimizing toward a goal might discover that acquiring more capability or resources is an effective intermediate strategy, even if power-seeking wasn't part of the objective. This isn't a present-day operational concern, but reinforcement learning systems with wide strategy bounds are entering territory where the failure modes are harder to anticipate than hallucination or bias.

FAQ

What does Vapi actually provide to developers?

Vapi is a platform API that handles the infrastructure complexity of real-time voice AI: bidirectional audio streaming at 20ms latency, orchestration of speech recognition, LLM inference, and text-to-speech components, interruption handling, turn detection, and scaling. Developers use it to build finished voice agent products that they then sell to businesses - restaurant bots, customer service agents, after-hours receptionists, and similar applications.

Why is voice AI infrastructure harder than chatbot infrastructure?

Text-based AI has latency tolerance - a 2-second response is fine. Voice conversation requires sub-200ms end-to-end latency or the interaction breaks down. Every 20 milliseconds, audio must travel from the user's device, through speech recognition, into the LLM, out through text-to-speech, and back to the user - continuously, bidirectionally, reliably at scale. The configuration surface area is enormous. Vapi abstracts all of that.

What kind of businesses should be using voice AI agents right now?

Any business with repetitive inbound phone volume: restaurants (reservations, hours, menu questions), professional service firms (appointment booking, intake), e-commerce (order status, returns), local services (scheduling, pricing). The cost of serving those calls has dropped by orders of magnitude. Nikhil's benchmark example: IRS wait times. A well-built voice agent eliminates them entirely.

How does Vapi make money?

Usage-based pricing - developers and companies pay for the voice infrastructure they consume. Vapi was profitable on this model before going through YC, which means the unit economics worked before any venture investment arrived. The $20M Series A accelerated growth, not profitability.

What does agent-to-agent voice communication look like?

Nikhil's near-term prediction: your AI agent calls a business's AI agent. They may negotiate and transact entirely without human involvement. Longer term, he speculates that two AI agents communicating might develop a more compressed, efficient encoded language between themselves - bypassing natural language entirely when efficiency is more important than human readability.

What did Nikhil get out of YC that wasn't the funding?

Three things: brand (the YC name opens doors with investors and customers), mentorship (partners who've seen 600+ companies at every stage of success and failure), and community (co-selling to other YC companies, learning from founders in adjacent spaces). He did the COVID remote cohort and notes the in-person version would have added a fourth dimension - the spontaneous relationship density that comes from physical proximity.

Building Voice AI with Nikhil Gupta of Vapi.ai