Building voice AI at scale: A conversation with Jake Chao

Whitney Rose
Content Marketing
January 9, 2026
2 min read

Voice AI gets a lot of attention for flashy demos. Running it reliably in production is a very different challenge.

In this Q&A, Jake Chao, Engineering Manager on Assembled’s Voice AI team, shares what it’s like to build voice agents that handle millions of real customer calls — from balancing latency and quality to scaling infrastructure and building a team that thrives on ownership.

This conversation has been edited for brevity and clarity.

Q: You worked at Meta earlier in your career. What's the biggest difference between building software at a fast-growing startup versus a company operating at a global scale?

The level of individual responsibility is night and day. At Meta, everything's nicely abstracted away for you. You can focus on one particular vertical and never really worry about stepping outside that lane.

At a fast-paced company like Assembled that works closely with customers, you're constantly getting opportunities to try new things, talk directly with customers, and do work outside your traditional role. Even as an engineer, I'm doing way more than writing code. I'm regularly on calls with customers, gathering feedback and helping with sales demos to make sure they work seamlessly. As a result, I'm learning a ton while driving impact across multiple parts of the company.

Q: What's something you've taken on since joining Assembled that you didn't expect?

I joined about a year ago on the AI agents team, initially focused on building Copilot — a product that helps support agents using AI — and email automations. Then, in the past year, I helped spin up our voice agents at Assembled. That was completely unexpected.

I never thought I'd work with voice AI in my career, but now it's all I do every day, and it's incredibly exciting. It's one of the fastest-growing parts of the AI industry. If you'd asked me a year ago whether I'd be working on a voice AI team handling millions of calls per year in production, I would've said no way.

Q: Can you give us a sense of the voice team — how you work together and what the dynamic is like day to day?

We're a small but mighty team. Right now, we have four engineers, one designer, and one product manager. What's really exciting is that we've been running voice agents in production for quite a while. We started some of our first voice agent calls earlier this year, and now we're handling millions of calls per year. This isn't just a demo — we're helping real people solve real problems.

Our top priority is creating the most conversational call possible for customers. All of our objectives and roadmap planning boil down to one question: Is this going to help our customer either solve their problem or make the conversation feel more human and well-paced? That makes it really easy for our team to stay focused.

The most special part is that everyone on the team has the independence to identify problems our voice agent is facing. Everyone has enough experience and context to have informed opinions on which problems are most important and well-thought-out solutions for tackling them. We're all strong engineers, designers, and product managers, and we do a great job of encouraging that kind of discourse.

Q: Voice AI is still a young space. What's it like working on this cutting-edge technology in production with real customers?

It's both really exciting and really hard. Voice presents a tough problem set. You're walking this tight line of wanting really high-quality calls with really low latency. It needs to feel conversational, be reliable, and maintain 99.99% uptime. When you combine all those requirements with really new technology, it creates exciting and challenging problems.

For example, how can we reduce the perceived latency for customers on calls? Even if our model needs time to think, how can we maintain that conversational pace while keeping quality high?

We're at the frontier of the frontier — voice AI is about as new and hot as it gets. There are so many people working on it, making it a really competitive space. We're getting pushed to move fast and try new things. New techniques, frameworks, and models are being released every day by new companies. Getting to evaluate all of those and figure out where we think the industry is going is really exciting.

At the same time, it's very challenging, especially when trying to balance things like latency and quality. But if it wasn't hard, it wouldn't be fun.

Q: Assembled is handling millions of calls with voice AI. What kinds of technical challenges show up as that volume scales?

The biggest thing is reliability. When you're handling phone calls with customers, your leeway with downtime is extremely limited. Obviously, our goal with any service at Assembled is 100% uptime, but the bar with live customers is so much higher. If we lose a call with a customer, that's a real person who needed help that we weren't able to support.

One of the biggest lessons I've learned working on this project from the ground up is how to build infrastructure that supports millions of calls a year. A big part of that was pairing with the fantastic infrastructure teams at Assembled.

Another chance is building that level of resilience while working with external systems. If we have an upstream provider who experiences problems and our services are impacted, how do we make sure our system can handle it? Ultimately, that comes down to two goals: maintaining the highest uptime possible for our voice agent, and ensuring that if it is impacted, calls always get to a live agent so customers still get the help they need.

Q: Is there an interesting technical problem your team has cracked lately that you can talk about?

There's a really voice-specific problem I'm particularly excited about. Say your voice agent's brain needs time to think through the answer to a question — how do you buy time for that brain to think? The alternative is using a really fast but less intelligent model that gives quick answers without substantive content, which doesn't help the customer.

So if you do need time to think or perform an action, like looking up a customer's account, the question is: how do you generate messages that acknowledge the customer, let them know they're heard, and communicate what you're doing? And how do you make sure that message is still conversational?

From a customer's perspective, the goal is that the acknowledgement message is cohesive with the actual reply, so the two feel like one continuous response. The customer shouldn't know these are two different modes of communication.

It involves a lot of parallelization within the LLM's brain in creative ways so you can still get a really high-quality answer while making the conversation feel natural. 

The challenge is contextualizing them properly. If a customer says, "I want help updating my account balance," the best message is probably, "I'm happy to help you with updating your account balance," because it acknowledges them with a positive tone. But if you're mid-conversation and they say, "My name is Jake, and my email is jake@gmail.com," it's probably best to just say "Okay" or "Got it." Contextualizing that to the conversation is actually quite difficult to get right.

Q: Can you tell us about our multi-agent architecture and why it's important?

Within the brain of our voice agent, there are actually multiple calls happening all in parallel and in sequence. We call it our multi-agent architecture. Depending on the context of the call — whether the caller is asking about a certain procedure, requesting a transfer, or asking a question relevant to the customer's knowledge base — we can transfer them to specific agents specialized for certain tasks.

This is important because low-latency LLMs like GPT-4o mini or Gemini 2.5 Flash are really fast, but you hit the limits of how smart they are and how well they can follow instructions pretty quickly. The more specialized you can make your prompts and agents for specific tasks, the better your outcomes.

We have an entire orchestration layer around determining what the customer is asking about, how to transition from one agent to another, and how to transition back if needed. This is one of the bigger technical pieces we started really early and saw a lot of results on, even when many people in the voice industry were still on the "throw all your context into one prompt and hope it works" approach. We realized pretty quickly at the scale we were at that wasn't going to work, and we needed to build more complexity.

Q: What are the pros and cons of being early in voice AI with a relatively small team?

The advantage of being early is that by being out there in production, you learn things quickly. When you're scaling to enterprise customers, you make a lot of assumptions about how to build demos or what systems will work for a simple knowledge base or set of procedure flows. We were only able to surface certain problems because we scaled quickly and did deep quality audits of our calls. That helped us see where things were working well, where there was room for improvement, and what the real limits of our system were.

I don’t think we would’ve gone through that feedback cycle as fast if we hadn’t been operating in production. In voice, it’s especially easy to build an exciting demo — there are tools that let you spin one up without writing a line of code. But once you put it into the wild, you start dealing with poor audio, frustrated callers, and companies with very specific expectations for how their agents should sound. Those realities are hard to anticipate without real usage.

Being early gave us momentum to work through those challenges sooner and build a mindset we still use today: when we see problems in our calls, how do we solve them creatively?

As for being a small team, most of us have been on the voice team for as long as the product has existed. We switched some of our most experienced engineers onto the team, and everyone has deep context on Assembled’s mission, the product, and what it takes to debug calls in production. That context lets us move faster than a larger team without it.

As an engineer, that’s especially exciting — every project you take on has a direct impact on how quickly the product improves and how fast Assembled can grow.

Q: What do you look for when interviewing candidates for the voice team?

If I could boil it down to one thing, it's ownership. How much care and pride does a person put into what they work on?

Every project on the voice team at Assembled has a direct, tangible impact on how well our product and company do. Having people who care deeply about their work and have a strong sense of ownership — "I'm going to do what it takes to get this project to completion, support our customers in the best way possible, and build the best voice agent and product at Assembled that I possibly can" — is crucial.

Seeing signals of that in an interview is always exciting because it makes it easy to imagine giving that engineer a project, even if it's ambiguous or just a problem statement. I want someone with strong ownership who will listen to calls, ask questions, figure out what the problems are, come up with solutions, get feedback, iterate, and execute.

Obviously, we're looking for really strong technical people too. But it all boils down to ownership — what are you willing to do to get this product to the best state it can be?

Q: When somebody on the team has an idea for improving your voice agents, how does it move from interesting thought to shipping in production?

It's fairly unbureaucratic. We're such a small team that if you have an idea for improving the voice agent, it's really just a matter of telling the team, "Hey, I noticed on some of these calls that our voice agent is giving the wrong answer or isn't as conversational as it could be. I have some ideas I want to focus on."

Usually, the only process is understanding what we've tried in the past. If someone's new on the team and doesn't have that context, it's important to know what we tried and what didn't work. That's where it's exciting to have done this for a while — you can say, "We tried that, but maybe this might work this time."

From there, it's just a matter of prioritization. We balance supporting our customers' incoming requests with pushing out new feature work. But really, if someone has an idea, there's not much process beyond, "What's the idea? What's the problem? How do you think you're going to fix it? That sounds great. Let's do it."

We do have some structure — weekly sprint planning where we discuss what folks will work on, what the most important top-level items are that drive us toward our quarterly goals, and how each project contributes to those goals. We also do quarterly goal planning together, so we're all on the same page about the top things we want to focus on.

My goal as a manager is to give as much independence and freedom as I can to my team so they can execute. They're on the team for good reason and have the skills to do that. As long as we're all working toward the same goal with a clear line of sight on what matters to our product, that's the best thing we can do as a team.

Q: The voice AI landscape is changing so quickly. How do you and the team stay on top of all the innovation?

It's a lot of knowing who the big players are in terms of who's coming out with the best models. We have connections through our founders and investors to the CEOs and founders of startups working on speech-to-text or text-to-speech models. We attend a lot of voice conferences — last week we went to the ElevenLabs summit, and I'm going to LiveKit Dev Day today. It's about going out and hearing what everyone else is doing. That's really the best way to filter through all the hot and new shiny things you see on LinkedIn or X.

The only way to filter all of that out is to talk to people who are doing this work as well. When I'm talking to another voice developer, the first question I ask is: where are you in your product journey compared to where we were or where we're at? What technologies are you using?

Listening to talks at these conferences is always helpful too — these are folks building the voice models, and it's valuable to understand where they think the current state of the art is. It's always helpful to measure ourselves against that.

Q: Looking ahead to next year, where do you think voice agents have the most room to meaningfully improve or differentiate?

Speech-to-speech is the frontier of the frontier of the frontier — it's about as new as you can get. We've experimented with it and used it in some places. It's really exciting because it's about as low latency as you can get. You can truly understand someone's conversational tone without any data loss as you go from speech to text and back through text to speech. The model can genuinely understand the tone of the person.

This is something I'm particularly excited about because we've seen folks like OpenAI invest time into it. Seeing where we can push that and where we can be along in that journey is going to be really exciting. That's going to make Assembled's voice agents more conversational and lower latency.

In terms of how voice agents differentiate themselves, it's really about making our agents smarter and giving them an entire tool set — "I can look up your account order, start a return for you, start that claim for you" — while still maintaining a really conversational call. These are the pillars of voice agents.

I fundamentally believe that as much hot and new shiny technology comes out — speech-to-speech or anything else — these pillars aren't going to change. The future of voice agents is about taking these pillars we already know exist and applying all these new, exciting technologies to them to continue leveling up.

Interested in building voice AI that actually runs in production? We’re hiring.

Tags
Life at Assembled