"Scale, Failure, And The Hard Parts Nobody Teaches"

Sachidananda Singh is a Senior Engineering Manager at Wayfair, leading the engineering teams behind the scheduling, routing, and delivery infrastructure for one of the largest home goods networks in the United States. His background is in distributed systems, not logistics, and when he arrived, he did something that shapes everything he talks about here: instead of pretending to understand the domain, he went to the warehouses, rode along on delivery routes, and treated the workarounds drivers had built around the software as the real spec.

In this interview, we get into how to design distributed systems that hold under pressure, what it takes to lead multiple engineering teams through complex cross-functional work, and which skills will matter most for engineering leaders in this space in the years ahead.

As I understand, you joined Wayfair without prior experience in last-mile logistics. Where did you start to quickly get up to speed in this new domain, and what approaches helped you grasp the complexities of delivery operations?

Honestly, the first thing I did was stop pretending I understood the domain. I came in from AWS, where I had led the time sync team, where correctness was everything, and the customer was another engineer. Large-parcel delivery is almost the opposite. The “customer” is a driver in a truck trying to back into a narrow driveway in Boston at 4 p.m. on a Friday. No textbook gets you there.

It also helped to know, coming in, that the industry was about to get harder, not easier. The World Economic Forum was already projecting that without intervention, urban delivery vehicles and related carbon emissions could rise by up to 60% by 2030, with congestion increasing by 14% and healthcare costs rising by 12%. When you join a logistics org with those headwinds against you, “business as usual” is not a neutral choice. It’s a losing one. That framing shaped how I prioritized from day one.

And how did that translate into your first weeks on the ground?

So I did two things in parallel.

The first was going to the warehouses. I shadowed dispatchers, rode along on delivery routes, and watched what happened when a stop failed, who got paged, which screen they looked at, and which field they retyped. You learn more in an afternoon on a loading dock than in a week of dashboards. You see the workarounds people have built for the software, and those workarounds are the real spec.

The second was mapping the system the way an engineer does. I traced a single order end-to-end, from checkout to POD (proof of delivery), and wrote down every service it touched, every team that owned it, and every place state changed hands. By the end of week three, I had an ugly diagram on the wall of my office that I kept pointing at for the next two years.

The thing that sped me up the most wasn’t a doc. It was asking operations people “what do you wish the system did that it doesn’t?” and treating every answer as a technical requirement I just hadn’t earned yet. Engineers are taught to solve. Managers new to a domain should be taught to listen first, then solve.

Wayfair projects involved driver-facing mobile apps, an internal fleet management system, and telemetry. Which engineering principles and design patterns proved most valuable in assessing and improving such complex systems?

I’d say three stood out.

The first principle was to decouple ingestion from everything else. Telematics is messy. GPS pings from trucks, check-ins from mobile apps, and third-party carrier feeds all arrive in different formats, at different cadences, and drop whenever a driver rolls into a connectivity dead zone. We put Kafka in front as a buffer and normalizer. That one decision absorbed traffic spikes that would otherwise have cascaded into the planning systems.

The second was to separate the read and write paths. Real-time dashboards and transactional writes have completely different requirements. Trying to serve both from the same store is how you end up with a dispatcher refreshing a tab while a scheduler times out. The database migration I led, moving off a tightly-coupled SQL setup onto PostgreSQL on GCP with event-driven sync, was specifically to make that separation real.

The third, and the one people most often underrate, was to design for the unhappy path first. In logistics, the happy path is easy. The box goes on the truck, the truck goes to the house, house accepts the box. The system earns its keep on the 5% where the customer isn’t home, or the truck breaks down, or the carrier’s API stops responding halfway through a batch. I made my teams start every design doc with “what happens when this fails?” and only then write the success flow. It changed the shape of what we built.

Patterns like CQRS, idempotent writes, and graph-based route queries were just tools. The principle underneath all of them was that state transitions should be observable, replayable, and safe to retry.

Is there a specific failure pattern that pushed you toward that kind of thinking?

One thing I’d add. Tail latency is where large systems actually fail their users. There’s a line from Azul Systems CTO Gil Tene, “outliers are the norm”, that gets quoted a lot at P99 CONF for a reason. In any fleet with tens of services and thousands of nodes, the low-probability events stop being low-probability in aggregate. TigerBeetle and PayPal have both talked publicly about engineering for determinism, single-threaded scheduling, predictable I/O, batching everywhere, to tame the long tail. Not every system needs that rigor, but the mental model does. Design for the worst 1% of requests, not the average.

In logistics, critical failures can have serious consequences. What does a “high-severity incident” mean to you, and how did you design processes to minimize cascading effects from errors?

A “high sev” for me isn’t defined by the technical blast radius. It’s defined by who can’t do their job right now. If a dispatcher in a hub can’t assign a driver for the next thirty minutes, that is a high-severity incident even if every service is technically returning 200s. The user-facing definition of broken has to win over the infrastructure-facing definition.

On containment, the thing I pushed hardest for was circuit-breaker discipline between services. A slow downstream dependency should not be allowed to hold a thread open indefinitely on the caller side. It sounds obvious. It is rarely done consistently until you’ve been burned. We got burned, and then we did it consistently.

Beyond that, every service had an explicit degradation mode. If the optimizer was down, the routing system would fall back to the last-known-good plan rather than refuse to serve. Stale is almost always better than blank in operations.

On top of that, we ran blameless postmortems, and this is the part most teams skip: we tracked the action items out of them with the same rigor as product work. An incident you don’t fix the root cause of is one you’ve agreed to have again.

The other piece was cross-team drills. The 99.999% SLA work at AWS taught me that you can’t hold five nines by writing good code. You hold it by rehearsing failure until the humans are as reliable as the systems.

The measure of a good incident process isn’t how fast you resolve the first one. It’s whether the second one with the same signature even happens.

Wayfair’s systems combine mobile apps, API gateways, cloud services, ML components, and telemetry. Technical debt is inevitable — how do you determine what’s critical versus what can be deferred, and what methods help maintain the balance between development speed and reliability?

I stopped calling it “technical debt” internally. The phrase lets people wave at a list of grievances and feel responsible without doing anything. I made teams be specific. Was this a correctness risk, a velocity tax, or a hiring tax?

A correctness risk is where the code can silently produce the wrong answer. This jumps the queue every time. In a delivery network, the wrong answer is a misrouted truck and a customer calling support.

A velocity tax is where every new feature costs 30% more because of this. You fix it when the next roadmap item would trip over it. Not before, not after.

A hiring tax is where a new engineer takes three weeks instead of three days to be productive. You fix it when onboarding becomes the bottleneck, not when it becomes annoying. Forcing the category up front stops the “everything is important” spiral.

The other thing I did was bake reliability work into the planning cycle, not bolt it on. We ran a standardized 6- and 12-month planning cadence across the Wayfair Delivery Network org. Every team had to allocate capacity to what we called “load-bearing work”, migrations, decoupling, and observability, before filling in features. That discipline is the main reason our project success rate moved from 58% to 90%. Not because we got better at shipping features. Because we stopped making promises we couldn’t keep.

Though I won’t pretend I got it right every time. I had a legacy scheduler component I knew was brittle, and I let it ride for an extra quarter because the team needed the air. It blew up. We recovered. But the lesson was that “we’re too busy to pay this down” is sometimes code for “we’re about to find out why this matters.”

Your projects spanned mobile development, backend architecture, cloud infrastructure, and routing logic. What challenges arose when integrating teams with different areas of expertise, and which communication and coordination methods proved most effective for accelerating progress without sacrificing quality?

The hardest part of cross-disciplinary work isn’t the technology. It’s that a mobile engineer, a backend engineer, and an ML engineer use the same word to mean three different things. “Latency” to a mobile engineer is the time to interact. To a backend engineer, it’s p95 of a service call. To an ML engineer, it’s inference time. Three teams, three definitions, one word, and a month of confusion if you don’t catch it early.

One thing that worked was a single design doc for multiple audiences. Any cross-team initiative had a single source-of-truth doc with a glossary at the top. Boring. Effective. I refuse to start a cross-team project without one now.

Beyond that, protocols over meetings. Instead of adding a weekly sync, we agreed on explicit handoff contracts, spelling out what artifacts each team owed each other and when. The carrier selection platform spanned nine engineering teams. That would have been unmanageable with standing meetings. It worked because the contracts were in writing.

The other principle was delegating outcomes, not tasks. Each team owned its slice end to end. My job was to unblock, not to assign. The moment a manager starts dispatching work across team boundaries, two things happen. Velocity drops and ownership evaporates.

On speed versus quality, it’s a false trade. In my experience, the teams shipping the fastest are also the ones with the tightest feedback loops, good tests, good observability, and good incident hygiene. Sloppiness is slow. I’d rather spend two weeks getting the seams right than six months unpicking them.

Many engineers aspire to move into management, but not everyone fully understands how the role changes. Which aspects of leadership were most unfamiliar or challenging for you when transitioning from an individual contributor to leading multiple engineering teams?

The first thing nobody tells you is that you lose your feedback loop. As an IC, you write code, you run tests, and something works or doesn’t within minutes. As a manager of multiple teams, the signal from a decision you made in January might not show up until June. You have to learn to act on incomplete information and be patient with ambiguity, and those are not natural skills for people who got into engineering to eliminate ambiguity.

The second thing. Your calendar stops being yours. I was running four teams, two direct-report managers, 25-plus engineers, and the math of one-on-ones, skip-levels, reviews, and cross-functional commitments fills a week before you’ve done any actual thinking. I had to get disciplined about protecting blocks for the slow work, like reading design docs, talking to operations people in the warehouses, and writing long-form strategy. If I didn’t guard those blocks, nothing of consequence got decided.

That gap between “I make things” and “I make things happen through other people” is also what the current literature on the IC-to-EM transition keeps returning to. LeadDev and Pragmatic Engineer have both written about the identity shift and the delegation gap as the two places where most new managers stall. I stalled there too. Pretending otherwise would be dishonest.

Where did you feel that most acutely?

Oh, that’s another part that was genuinely uncomfortable for me. When a team hit a hard technical wall, my instinct was to dive in. Satisfying, and wrong. The right move is to ask the questions that unblock them without taking the wheel. I got better at this over time. I mentored two ICs into management roles, and watching them solve things I would have solved differently was one of the more useful teaching experiences of my career.

And then there’s the part about people. Retention is a trailing indicator of whether you’re actually doing the job. Our teams held 95% retention through mentorship and career development, and I’m proud of that number, but more than the number, I trust the fact that engineers kept bringing me hard problems instead of hiding them. That’s the signal I watch for.

Looking back at Wayfair and your overall leadership journey, which personal principles and habits helped you make decisions, set priorities, and achieve results in highly uncertain environments?

The one I’d put first is anchoring on the user of the system, not the system. In logistics, that’s the dispatcher, the driver, the supplier. Every architectural choice gets tested against “Does this make their next action easier?” It’s a simple question, and it cuts through an astonishing amount of noise.

Closely behind that, measurable outcomes or it didn’t happen. I can say the platform unification work was valuable, or I can say it increased deliveries per route by 50% and consolidated 14 tools down to 5. The second one survives a change of leadership, a reorg, and a budget review. The first one doesn’t. I push my teams to quantify impact even when it feels premature, because the habit of measuring shapes the habit of thinking.

Then there’s naming the risk early. In uncertain environments, people default to optimism because optimism is less socially expensive than bad news. I learned to be the person who said, “I think this is going to slip” early. That sounds pessimistic. In practice, it’s the opposite. Naming the risk early is what lets you do something about it.

And of course, investing in people beyond the current project. The ML carrier selection work delivered roughly 7% annual logistics savings, but the thing I’m most durably proud of is the managers I promoted from inside those teams. Projects end. Careers compound.

One habit I’d add. Write more than you think you need to. Strategy documents, decision records, and postmortems. Writing forces clarity in a way that talking does not. A lot of the decisions I got right, I got right because I was forced to articulate them on paper first and noticed the holes.

Finally, looking ahead, which engineering and management skills do you see as critical for leaders working with distributed systems and logistics, and how would you recommend developing these capabilities?

To begin with, real AI fluency. Not “I’ve used ChatGPT.” I mean knowing when a large language model actually belongs in a production path and when a boring rules engine is the right answer. At Wayfair, we built an AI-powered Supplier Experience Platform on top of Gemini Pro with vector embeddings to answer supplier policy questions, and it moved the Supplier NPS by 10 points. It worked because the problem was a genuine retrieval-and-summarization problem. The same architecture would have been a disaster for scheduling, where correctness has to be provable. Leaders have to be able to tell those two situations apart, in the room, in real time. You develop that by actually building with these tools, not by reading about them.

And the market is making the bet for us. McKinsey puts AI-based route optimization savings at 10 to 30% of delivery cost. DHL’s Greenplan has publicly reported roughly 20% cost reductions from continuous route adjustment. UPS’s ORION is cited as saving over 10 million gallons of fuel per year. The global route-optimization software market is projected to grow from roughly $8B in 2025 to nearly $16B by 2030. The point isn’t the specific numbers. It’s that the direction is unambiguous. Leaders who can’t reason about when and where AI earns its keep in this stack will be making decisions downstream of people who can.

Beyond AI fluency, what else do you see as non-negotiable for the next generation of engineering leaders?

Cross-functional fluency is the next one. Modern distributed systems in logistics don’t live inside one team or one language. The carrier selection platform spanned nine engineering teams. The 99.999% SLA work I did on Chronos at AWS needed coordination across multiple org boundaries. You can’t fake your way through that with technical depth alone. You have to be able to translate between engineering, operations, and business in a way each group trusts. The way you develop this is to volunteer for the cross-team initiative nobody else wants. Painful, and the fastest training available.

Sustainability as a first-class design concern. Not a slogan, but a KPI alongside latency and cost. Route optimization will increasingly include emissions per delivery, and the leaders who already think that way will have a head start.

And then there’s comfort with your own toolkit being temporary. What I used five years ago is already out of date. What I’m using now will be. The habit to develop isn’t mastery of any specific stack. It’s the meta-skill of learning a new one quickly without losing your principles.

If I had to compress it to one sentence: get extremely good at separating what’s changing, tools, frameworks, vendors, from what isn’t, users, trade-offs, trust. The leaders who do that well will be fine. The ones who don’t, won’t. It’s as simple as it sounds.

Also Read:

“Scale, Failure, and the Hard Parts Nobody Teaches”: A conversation with Senior Engineering Manager Sachidananda Singh

Sachin Reddy

Leave a Comment Cancel Reply