SiliconANGLE theCUBESiliconANGLE theCUBE
  • info
  • Transcript
Clip #7: Uday describes how Uber managed to establish system-wide coherency across its data elements
Clip Duration 04:32 / June 17, 2023
Breaking Analysis: Uber’s architecture represents the future of data apps…meet its architects
Video Duration: 51:44
search

From the Cube Studios in Palo Alto in Boston, bringing you data-driven insights from the Cube and ETR, this is breaking analysis with Dave Vellante. Uber has one of the most amazing business models ever created. The company's mission is underpinned by technology that helps people go anywhere and get anything. The results have been stunning and just over a decade, Uber has become a firm with more than 30 billion in annual sales and a market capitalization of nearly 90 billion as of today. Moreover, the company's productivity metrics when you measure things like revenue per employee are three to five times greater than what you'd expect to find at a typical technology company. In our view, Uber's technology stack represents the future of enterprise data apps where organizations will essentially create real time digital twins of their businesses, and in doing so, deliver enormous customer value. Hello and welcome to this week's Wikibon Cube Insights powered by ETR. In this breaking analysis, Cube analyst George Gilbert and I will introduce you to one of the architects behind Uber's groundbreaking fulfillment platform.

We're going to explore their objectives, the challenges they had to overcome, and how Uber has done it. And we believe the company is a harbinger for the future of technology. Now the technical team behind Uber's fulfillment platform went on a two year journey to create what we see as the future of data apps. And it's our distinct pleasure to welcome to the program Uday Kiran Medisetty, who's a distinguished engineer at Uber, and he's led, bootstrapped, and scaled major real time platform initiatives in his time at Uber and has agreed to share how the team actually accomplished this impressive feat of software and networking engineering. Uday, welcome to the program. It's great to see you. Hello, George. All right, Uday, start if you would, by telling us a little bit about yourself and your role at Uber. Yeah, hi George. Hi Dave. Super nice to be here. I joined Uber back in 2015 when we were primarily doing on demand UberX and we were primarily in North America. And over the last eight years, I have witnessed Uber's tremendous growth. You know, how we have expanded from on demand mobility to all kinds of personal mobility, how we have expanded from just mobility to all kinds of delivery. And the mission that you just said, go anywhere and get anything, that is the total accessible market that is seen the world. And that's what drives us here. And that's what kept me here with the same energy even after eight years. So II work on the core mobility business and the bunch of foundational business platforms that are leveraged across mobility and delivery. I also lead uber wide senior engineering community where we set best practices so that we can move at the same pace across all of the engineering team across Uber. So yeah, that's my quick intro. Yeah, I remember the first time I ever used the Uber app, I was stuck in the hinterlands outside of Milan. Couldn't get a cab. And I said, "I'm going to try this Uber thing." And this was like the early part of last decade, and it was like a ChatGPT moment. Now back in March, George and I, and just last week as well, introduced to the audience this idea of Uber as the future of enterprise data apps. And we put forth the premise that the future of digital business is going to manifest itself as a digital twin that represents people, places, and things. And that increasingly business logic is going to be embedded into data versus the way it works today, and applications are going to be built from this set of coherent data elements. So when we go back and look at the progression of enterprise apps throughout history, we think it's useful to share where we think we are on this journey.

So George put together this graphic to describe the history in simple terms, starting with 1.0, which was departments and back office automation, and then in the orange is the sort of ERP movement where a company like Ford for example, could integrate all its financials and supply chain and all its internal resources into a coherent set of data and activities that really drove productivity kind of in the nineties. And then Web 2.0 for the enterprise. So here we're talking about using data and machine intelligence in a custom platform to manage an internal value chain where we're using modern techniques. And use the here the example of amazon.com, not AWS, but the retail side of the operation. And then in the blue, we show enterprise ecosystem apps, which this is where we place Uber today, really one of the first, if not the first, to build a custom platform to manage an external ecosystem.

Different, of course, from the gaming industry that we show there on the right hand side. And our fundamental premise is that what Uber has built, and we're going to get into this, because Uber is on its own journey, even within that blue ellipse, but our premise is that eventually mainstream companies are going to want to use AI to orchestrate an Uber-like ecosystem experience using packaged off the shelf software and services. And so you see most organizations, they don't have a team of Udays. They can't afford it, they can't attract the talent. So we think this is where the industry is headed and Uber is a Harbinger example. And George, you have a burning question for Uday. So go ahead. So Uday, it's a big picture question, but it has to do with helping people understand not just the consumer experience of the app, but the architecture of an application that is trying to orchestrate an ecosystem and how different that is from where we've been, which are these packaged apps that managed repeatable processes that were, you know, pretty much almost the same across different businesses with maybe room for customization. It's so radical and we are so accustomed to living in it out here in tech bubble land, but tell us, you know, help us understand sort of big picture what a big transformation that is from the applications point of view. Yeah, so one of the fascinating things about building any platforms for Uber is how we need to interconnect what's happening in real world and build large scale, real-time applications that can orchestrate all of this at scale. You know, like there is a real person waiting in the real world to get a response from our application whether they can continue with the next step or not. If you think about our scale, with, you know, the last FIFA World Cup, we had 1.6 million concurrent consumers interacting with our platform at that point in time. This includes riders, eaters, merchants, drivers, couriers, and all of these different entities, they are trying to do things in real world and our applications has to be real time, they need to be consistent, they need to be performant, and on top of all of this, we need to be cost effective at scale.

Because if we are not performant, if you're not leveraging the right set of resources, then we can explode our overall cost of managing the infrastructure. So these are all some unique challenges in building an Uber-like application. And we can go into more details on various aspects both at breadth and also in depth. Right, so Uday, I mean this vision that you sort of laid out, it requires an incredible amount of data to be available, as you said, in in real time or near real time. Uday's team, a couple of key blogs that we'll put into the show notes, but I mean, I've probably got seven hours into them and I'm still like going back and try to squint through them. So we really appreciate you sort of up-leveling it here and helping our audience understand it. But what was it about the earlier 2014 architecture, you described this in one of your blogs that limited the realization of your mission at scale and catalyzed this architectural rewrite. And we're particularly interested in the trade-off that you had to make that you talked about in your paper, your blog to optimize for availability over consistency. Why was that problematic, and let's talk about how you solved that? Yeah, you know, if you think about back in 2014 and what was the most production ready databases that were available at that point, we could not have used at that point in time traditional SQL-like systems because of the scale that we had even at that point in time. And the only option we had, which provided us some sort of scalable realtime databases was no-SQL kind of systems. So our app, so we were leveraging Cassandra and the entire application that drives the state of the online orders, state of the driver sessions, all of the jobs, all of the waypoints, all of that was stored on in Cassandra. And over the last eight years we have seen, you know, the kind of fulfillment-use cases that we need to build.

That has changed a lot. So whatever assumptions that we have made in our core data models and what kind of entities we can interact, it has completely changed. So we had to, if not anything else, change our application just for that reason. The second, because the entire application was designed with availability as the main requirement, and latency was more of a best effort and consistency was more of a best effort mechanism, whenever things went wrong, it made it really hard to debug. For example, like we don't want a scenario where if you request a ride to drivers show up at your pickup point, because the system could not reconcile whether this trip was already assigned to a particular driver or it wasn't assigned to anyone. And those were real problems that would happen if we don't have a consistent system.

And so the three main areas of problems at the infrastructure layer at that point, one is consistency that I mentioned already, and because we didn't have any automaticity, we had to make sure the system automatically reconciles and patches the data when things go out of sync based on what we expect the data to be. There was a lot of scalability issues. Because we were getting to a best effort consistency, we were using at the application layer some sort of hashering. And what we would do is, oh, let's get all of the updates for a given user routed to a same instance and have a queue in that instance so that even if a database is not providing consistency, we have a queue of updates.

So we make sure there's only one update at any point in time. That works when you have updates only in two entities, so then at least you can do application level orchestration to ensure, you know, they might eventually get in sync, but it doesn't scale beyond that. And because you are using hashering, like we could not scale our cluster beyond the vertical limit. And that also inhibited our scale challenges, and especially if we want like large cities that we want to handle, we couldn't go beyond a certain scale. So these were the key infrastructure problems that we had to fundamentally fix so that we can set ourselves up for the next decade or two. Yeah, makes sense. So when the last update wins, it may not be the most accurate update, so. All right. And then George, when you and I were talking about this, you said, Dave, you know, it might not just be scale, it was sort of Uber thinking about the future, but elaborate on that, George. So Uday, what I wanted to know was, you guys had to think about a platform more broadly than just drivers and riders, 'cause you had new verticals, new businesses that you wanted to support. And you know, while the application layer manages things, the database generally manages strings, but the new capabilities in the database allowed you, as you were describing, to think of consistency differently and latency. But can you talk about also how you generalized the platform to support new businesses? Yeah, so that's a great question. You know, like one of the things we had to make sure was as the kind of entities change within our system, as we have to build new fulfillment flows, we need to build a modular and leverageable system at the application level. At the end of the day, we want the engineers building core applications and core fulfillment flows abstracted away from all of the underlying complexities around infrastructure, scale, provisioning, latency, consistency. Like they should get all of this for free, and they don't need to think about it. When they build something, they get the right experience out of the box. So what we had to do was, at our programming layer, we had a modular architecture where every entity, like let's say there is a order, there is an order representation, there's a merchant, there's a user or a organization representation, and we can store these objects as individual tables and we can store the relationships between them as another table that stores the relationships between these objects.

So whenever new objects get into the system and whenever we need to introduce new relationships, they are stored transactionally within our system. We use the core database, you can think of it as a transactional key value store. The database layer, we still only store the key columns that we need and rest of the data is stored as a serialized blob so that we don't have to continuously update the database. Anytime we add new attributes for a merchant or for a user, we want to make that use that operational overhead. But at the high level, every object is a table and then every relationship is a row in another table, and then whenever new objects or relationships get introduced, they are transactionally committed. Got it. Dave, I just want to add that what's interesting is he just described an implementation of a semantic layer in the database. Right, right. We've talking about this all for months, George, and the importance of it. And I want to come back to that. Let's help the audience understand at a high level, Uday, the critical aspects and principles of the new architecture. What we're showing here is a chart from Google Engineering in one of the blogs. And we want to understand how your approach, again, differs from your previous architecture. And you touched on some of that. So the way we understand this is the green is the application layer, which is sort of intermixed. The left hand side shows that. And on the right hand side you've separated the application services at the top from the data management below, and that's where Spanner comes in. So how should we understand this new architecture in terms of how it's different than the previous architecture? Yeah, so in the previous architecture, we went through some of the details, right? Like the core data was stored in Cassandra and because we want to have low-latency reads, so we had a Redis cache as a backup whenever thing, whenever Cassandra fails, or whenever we want some low-latency reads and we went through Ringpop, which is the application layer shard management, so that the request get routed to the instance we need. And there was one pattern I didn't mention, which was on Saga pattern, which was a paper from a few decades ago. Ultimately there was a point in time where the kind of transactions that we had to build, it evolved from just two objects. Like imagine a case of we want to have a concept of a batch offer, which means a single driver should accept multiple trips at the same time or not.

Now, you don't have now one each to one association, you have a single driver, I have maybe two trips, four trips, five trips, and you have some other object that is establishing this association. Now if we need to now create a transaction across all of these objects, we tried using Saga as a pattern, extending our application layer, transaction coordination. But again, it became even more complex because if things go wrong, we have to also write compensating actions. So that system is always in a state where they can proceed. We don't want users to get stuck and then not get new trips. So in the new architecture, the key foundations we mentioned, one was around strong consistency and linear scalability. So the new SQL kind of databases provide that.

And we went through exhaustive evaluation in 2018 across multiple choices we had. And at that point in time we picked Spanner as the option. And so we move all of the transaction coordination and scalability concerns at the database layer, and at the application layer, we focus on building the right programming model for building new fulfillment flows. And the core transactional data is stored in Spanner. We limit the number of RPCs that we go from our on-prem data centers to Google Cloud because it's a latency sensitive operation, right? And we don't want to have a lot of chatter between these two worlds. And we have an on-prem cache which will still provide you point in time snapshot reads across multiple entities so that they're consistent with each other.

So for most use cases, they can read from the cache and spanner is only used if I want strong reads for a particular object. And if I want cache reads across multiple objects, I go to my cache. If I want to search across multiple objects, then we have our own search system which is indexed on specific properties that we need. So that if I want to get all of the nearby orders that are currently not assigned to anyone, we can do that low latency search at scale. And obviously we also emit Kafka events within Uber stack, so then we can build all sorts of near real time or roll lab applications and then it's also go show raw tables, then you can build more derived tables using SparkJobs, but all of those things are happening within Uber's infrastructure and we use Spanner for strong reads and core transactions that we want to commit across all of the entities and establishing those relationships that I mentioned. All right, so George, coming back to the sort of premise, this is how you've taken, Uday, these business entities, the drivers, riders, routes, ETAs, orders, and you've reconciled the trade-offs between latency, availability, and consistency. Would it be fair to say, Uday, that because you did such a good job matching between the things in the application and the things in the database, that you were able to inherit that transactional strengths of the database at both layers, at the database level and to simplify that coordination at the application level and that you also did something that people talk about but don't do much, which is a deep hybrid architecture where you had part of the application on prem and part, you know, using a Google service that you couldn't get elsewhere, often Google Cloud. Yeah, absolutely, absolutely. And then I think one more interesting fact is how for most engineers, they don't even need to understand behind the scenes. It's being powered by Spanner or any database. The guarantees that we provide to more application developers who are building, you know, fulfillment flows is they have a set of entities and they say, hey, for this user action, these are the entities that need to be transactionally consistent and these are the updates I want to make to them. And then behind the scenes are application layer, leverages, spanners, transaction, buffering, make updates to each and every entity, and then once all the updates are made, we commit so then all the updates are reflected in the storage so that the next strong read will be the latest update. So the database decision obviously was very important. We are curious, what was it about Spanner that led you to that choice? It's a globally consistent database. What about it made it easier for all the applications data elements to share their status? You said you did a detailed evaluation. How did you land on Spanner? Yeah, any kind of choice requires, there is a lot of dimensions that we evaluate, but one is we wanted to build using a new SQL database because we want to have the mix of, you know, asset guarantees that SQL systems provide and horizontal scalability that no-SQL kind of systems provide and new SQL and building large scale applications using new SQL databases, at least around that time when we started, that we didn't have that many examples to choose from. Even within Uber we were kind of the first application for managing live orders using new SQL based system. But the specific properties that, you know, in some sense we need are external consistency, right?

Like I kind of mentioned, which is Spanner provides the strictest concurrency control guarantee for transactions so that when the transactions are committed in a certain order, any specific read after that, they see the latest data, because that is very important because, you know? Imagine we assigned a particular job to a specific driver or courier, and then next moment if we see that, oh, this driver is not assigned to anyone, we might make wrong business decision and then assign you one more trip and that that will lead to wrong outcomes and then horizontal scalability, because Spanner automatically shards and then it'll rebalance the shards. And so then we have this horizontal scalability, in fact we have our own autoscaler that listens to our load and Spanner signals and constantly adds new nodes and remove nodes because the traffic pattern Uber has changes based on time of the day and then hour of the day and then also day of the week.

It's very curvy. So then we can make sure we have the right number of nodes that are provisioned to handle the scale at that point in time. I kind of mentioned the server side transaction buffering, that was very important for us so that we can have a modular application so that each application, each entity that I'm representing, they can commit, update to that entity independently, and layer above is coordinating across all of these entities. And once all of these entities have updated their part, then we can commit the overall transaction. So the transaction buffering on the server side helped us at the application layer to make it modular. Then all the things around stale reads, point in time reads, bounded stillness reads, these help us build the right caching layer so that for most reads, our cache rate probably is like on high 60, 70.

So for most reads, we can go to our on-prem cache and only for when there's a cache miss or strong reads, we can go to a store read system. So these were the key things one we want from new SQL and then Spanner was the one because, with the time to market, because it's already productionized and we can leverage that solution, but all of these interactions are behind an RM layer with the guarantees that we need. So this will help us, you know, like over time figure out if we need to evaluate other options or not. But right now we are, for most developers, they don't need to understand what is powering behind the scenes. Yeah, and the outcome for your customers is pretty remarkable. I mean, George and I, Uday, were really interested, George was sort of alluding to this before, the aspects of the system that enable this coherency across all these data elements of the system that it has to manage. In other words, your ability to get agreement on the meaning of a driver, a rider, a price, et cetera. And how you design and achieve that layer to enable that coherence, that is tech that you had to develop, correct? Yeah, absolutely. You know, I think there are many objects also, you know, we need to really think about what attributes of what a user sees in the app need to be coherent can be kind of stale, but you don't necessarily notice because not everything need to have the same amount of guarantees, same amount of latency and so on, right? So if you think about some of the attributes that we manage, we talked about like the concept of orders, if a consumer places any intent that is an order within our system, a single intent might require us to decompose that intent into multiple sub objects. Like for example, if you place an Uber Eats order, there is one job for the restaurant to prepare the food and there is one job object for the courier to pick up and then drop off.

And when courier job object, like we have many waypoints, which is the pickup waypoint, drop off waypoint, each waypoint can have its own set of tasks that you need to perform. Like for example, it could be taking a signature, taking a photo, paying at the store, all sorts of tasks, right? And all of these are composable and leverageable. So I can build new things using the same set of objects. And if in any kind of marketplace we have supply and demand and we need to ensure there is a right kind of dispatching and matching paradigms. In some cases, you know, we assign, we offer one job to one supply. In some cases it could be image to end, in some cases it is blast to many supplies. In some cases, they might see some other surface where these are all of the nearby jobs that you can potentially handle.

So this is another set of objects which is super real time, because like when a driver sees an offer card in the app, it goes away in 30 seconds and in 30, 40 seconds they need to make a decision, and based on that we have to figure out the next step, because, you know, within Uber's application we have changed user's expectation of how quickly we can perform things. If we are off by a few seconds, people will start canceling. Then Uber is hyper-local, then we have a lot of attributes around latitude, longitude, route line, driver's current location, our ETAs. These are probably like some of the hardest to get right, because we constantly ingest the current driver location every four seconds, we have lot of latitude longitude. like this throughput of this system itself is like hundreds of thousands of updates per second.

But not every update will require us to change the ETA, right? Like your ETA is not changing every four seconds. Your outing is not changing every four seconds. So we do some magic behind the scenes to make sure that, okay, have you crossed city boundaries, only then we might require you to update something. Have you crossed some product boundaries, only then we require you to do some things. So we do that inferences to limit the number of updates that we are making to the code transactional system, and then we only store the data that we need, and then there's a complete parallel system that manages the whole pipeline of, you know, how we receive the driver side of equations and generate navigations and stuff for drivers, and then how we convert these updates and then show it on the rider app. That stream is completely decoupled from the core orders and jobs.

And, you know, if you think about Uber system, it's not just about building the business platform layer, like we have a lot of our own sync infrastructure at the edge API layer because we need to make sure all of the applications data is kept in sync. They're going through choppy network conditions, they might be unreliable, and we need to make sure that they get the updates as quickly as possible with low latency irrespective of what kind of network condition they are in. So there's a lot of engineering challenges at that layer as well. Ultimately, all of this is working together to provide you the visibility that hey, you know, I can exactly see what's going on, because if you're waiting for your driver, if they don't move, you might cancel assuming that, hey, they might not show up. And we need to make sure that those updates flow through, not just through our system, but also from our system back to the rider app as quickly as possible. So hopefully we're, George, you had a question? Yeah, this is mean, this is something new. We're on new territory, at least as far as Dave, what what we've explored before. What I'm taking away is that you're not just managing this layer at the application where you've got Ubers, entities, or things, but you're also translating that down to the database and the database is, you know, transactional, semantics, making it sort of easier to manage and orchestrate those things. But what you're describing is something where the data is sort of liveliness is an attribute that makes managing it separate from just mapping it down to the database. You manage how it gets updated and how it gets communicated separately based on properties that are specific to each data element.

And by data element, I mean property. Not like a driver, you know, or courier. And that is interesting, 'cause Dave, just as a comment, Walmart talked about prioritizing data and you know, for communications from stores and the edge. And that may lead into a follow on question. Sorry for the long preamble, but the question I have, Uday, is what happens when you are orchestrating an ecosystem with 10 or a hundred times as many things as you are now and more data on all those things than you have now? Have you thought about what a world looks like where the centralized database may not be the central foundation? See, I think that's where the trade offs come in. We need to be really careful about not putting so much data in the core system that manages these entities and these relationships and overwhelm it with so much data that I think we will probably hit some, then we'll end up hitting scale bottlenecks. You know, for example, the fare item that you see both on the rider app or on the driver app, that item is made up of hundreds of line items with different business rules, specific different GOs, different localities, different tax items. We don't store all of that in the core object. But one attribute for a fare that we can leverage is a fare only changes if the core properties of riders object, rider's requirements change.

So every time you change your drop off, then we regenerate the fare. So I have one fare UID. Every time we regenerate, we create a new version of that fare and store these two UIDs along with the my core order object so that I can store in a completely different system my fare UID, fare version, and all of the data with all of the line items, all of the context that we use to generate that line items. Because what we need to save transactionally is the fare version UID. When we save the order, we don't need to save all of the fair attributes along with that. So these are some design choices that we do to make sure that, you know, we limit the amount of data that we store for these entities. In some cases we might store the data, in some cases we might version the data and then store along with that.

In some cases, if it is okay to tolerate that data and it doesn't need to be coherent with the core orders and jobs, it can be saved in a completely different online storage. And then we have at the presentation layer where we generate the UI screen. There, we can enrich this data and then generate the screen that we need. So all of this will make sure that we limit the scale of growth of the core transactional system and then we leverage other systems that are more suited for the specific needs of those data attributes. But still all of them tie into the order object and then there's an association that we maintain. So this is really important and we're going to actually revisit this as a guide to the future, but I would just want to take a pause and reset here and hopefully the audience understands that what Uber has built is different, of course, than conventional apps. We try to sort of put this together in a slide to describe the sort of 3.0 apps. If Alex, you bring up the next one. So starting at the bottom, you have the platform resources and then the data layer to provide that single version of the truth and then the application services that govern and orchestrate the digital representations of the real world entities, drivers, riders, packages, et cetera, and that all supports what the customer sees in the Uber app. So the big difference from the cloud stack that we all know and love is, you know, Uber's not selling us compute or storage.

We don't even see that. Rather Uber's offering up things, access to drivers and merchants and services. And so Uday, where are the lines between sort of your thinking of commercial off-the-shelf software that you were able to use versus the IP that Uber had to develop itself to achieve these objectives. Can you describe sort of that thinking and what went into that build versus buy? Yeah, you know, in general, we rely on a lot of open source technologies, commercial off-the-shelf software, and in some cases, in in-house developed solutions. Ultimately it depends on, you know, the kind of specific use case, time to market, maybe you want to optimize for cost, optimize for maintainability. All of these factors come into picture. For the app, the core orders and the code fulfillment system, we talked about Spanner and how we leverage that with some specific guarantees. We use Spanner for even our identity use cases where we want to manage, you know, especially in large organizations, we want to make sure your business rules, your AD groups, your stuff, how we capture that for our consumers that has to be in sync. But there is a lot of other services across microservices, across Uber that leverage Cassandra.

If their use case is high ride throughput. And we leverage Redis for all kinds of caching needs. We leverage ZB and Zookeeper for low level infrastructure platform storage needs. And we also have a system that is built on top of MySQL with a RAF-based algorithm called DocStore. So for majority of the use cases, that is our go-to solution where it provides you shard local transactions and it's a multi moderate database. So it's useful for most kind of use cases and it's optimized for cost, because we manage the stateful layer, we manage and we deploy it on our nodes. So for most applications, that will give us the balance of cost and efficiency and for applications that need the strongest level of requirements like fulfillment or identity where we use Spanner, for higher ride input, we use Cassandra.

And beyond this, you know, when we think about our metrics system, M3DB, it's an open source software, open source by Uber, contributed to the community few years ago, it's a time series database. Like we ingest millions of metric data points per second and we had to build something on our own. And now it's an active community and there's a bunch of other companies leveraging N3DB before metric storage. So ultimately, you know, in some cases we might have built something and open sourced it in some cases we leverage off the shelf, in some cases we use completely open source and like contribute some new features. For example for, for DataLake, Uber pioneered Apache back in 2016 and contributed. So then we have one of the largest transactional data lake with maybe 200 plus petabytes of data that we manage. Got it. Okay. This next snippet that we're going to share, it comes from an ETR round table, which our data partner, and they do these private round tables. We'll pull it up and I'll read the quote from a pretty famous technical guru who's going to remain unnamed only 'cause I'm not sure I have permission to name this individual, but he says, "everybody in the world is thinking about real time data, and whether it's Kafka specifically or something that looks like Kafka, real time stream processing is fundamental. When people talk about data-driven businesses, they very quickly come to the realization that they need real time, because that's where there's more value. Architectures built for batch don't do real time well." The person mentioned Cockroach, said it's super exciting. "I feel weird endorsing a small startup," he said, "but Google Spanner is amazing and Cockroach is the closest thing that you could actually buy off the shelf and run yourself rather than be married to a managed service from a single cloud vendor." So Uday, you know, a couple of questions here.

I'm curious as to how you changed the engine in mid-flight going from the previous architecture, you know, pre-2014 to post, and George mentions what happens when real time overwhelms the centralized database's ability to manage all this data in real time and it sounds like you architected at least quite a runway to avoid that. But two questions there. How'd do you change the engine in mid-flight and when do you see it running out of gas? Yeah, the first question, you know, one of the things I think is designing a new greenfield system is one thing, but moving from whatever you have to the greenfield system is 10 X harder. And the complex, the hardest engineering challenges that we had to solve was for how we go from A to B without impacting any user. We don't have the luxury to do a downtime where, "hey, you know, we're going to shut off Uber for an hour and then let's do this migration behind the scenes." And then we went through, the previous system was using Cassandra with some in-memory queue, and then the new system is strongly consistent. How do you go from, the core database guarantees are different, the application APIs are different, so what we had to build was a proxy layer that for any user request, we have a backward compatibility, so then we shadow what is going to the old system and new system.

But then because the properties of what transaction gets committed in old and new are also different, it's extremely hard to even shadow and get the right metrics for us to get the confidence. But ultimately, so that is the shadowing part. And then what we did was we tagged a particular driver and a particular order that gets created, whether it's created in the old system or new system, and then we kind of gradually migrate all of the drivers and orders from old to new. So there would be at a point in time you might be seeing that marketplace is kind of split where half of those orders and earners are in the old, half of them are in the new, and then once all of the orders are moved, we switch over the state of remaining earners from old to new.

So one, we had to do a lot of unique challenges on shadowing, and two, we had to do a lot of unique tricks to make sure that we give the perception of there is no downtime and then move that state without losing any context, without losing any jobs in flight and so on. And then if there is a driver who's currently completing a trip in the old stack, we let that complete and the moment they're done with that trip, we switch them to the new stack so that their state is not transferred midway through a trip. And so then once you create new trips and new earners through new and then switch them after complete the trip, we have a safe point to migrate. You know, this is similar to like 10 years ago, I was at VMware and like we used to work on how do you do vMotion, like virtual machine migration, from one host to another host, so this was kind of like that kind of challenge.

What is the point at which you can split, you can move the state without having any application impact? So those are kind of the tricks that we had to do. And the second question on how do we make sure we don't run out of gas? You know, we kind of went through already. Like one, obviously we are doing our own scale testing, our own projected testing to make sure that we are constantly ahead of our growth and make sure the system can scale, and then we are also very diligent about looking at the properties of the data, choosing the right technology so that we limit the amount of data that we store for that system and then use specific kind of systems that are catered to those use cases.

Like for example, like all of our matching system, if it wants to query all of the nearby jobs and nearby supplies, we don't go to the transactional system to query that. We have our own inbuilt search platform where we are doing real time ingestion of all of this data using CDC and so then we have all kinds of anchors so that we can do real time, on the fly generation of all of the jobs because the more context you have, the better marketplace optimization we can make and that can give you the kind of efficiency at scale, right? Otherwise, we'll make imperfect decisions, which will hurt the overall marketplace efficiency. Yeah, and in your blog post, you had said that you had to build this architecture to support your business for the next decade. So from inferring, you don't see any, at least in the near term, all these data elements and all this real time data overwhelming the system because of the way you've architected it. Is that a fair assertion? Yeah, absolutely. I think we're confident at least like, you know, for the foreseeable future. What we have is a stable foundation and, you know, since then you could see the kind of new use cases that we are building, right? Like Uber Reserve, now you can reserve 30 days in advance. Now we are are entered into grocery, we are doing where a courier is going and then shopping for you. Recently you might've seen announcements on Party City, on PetSmart. So we want to make sure that we can go anywhere and get anything. We can unbundle every use case that you need a car for and then provide affordable, scalable transportation solutions so that we can handle all of your mobility needs on demand, at scale, at your fingertips. And then we can capture every single merchant in the world and then capture in our system every single catalog, every single item, manage relationships across all of them.

We have millions and millions of catalog items around the world. And then so that you can go and get anything that you need, whether it is food, whether it's alcohol, whether it is some party item, whether it's some pet food, whether it's convenience, whether it's pharmacy, everything is handled. So at least right now, I'm confident that we can scale to those needs and then we have the system that can scale to that needs. Right. You know, last question is, George and I have been sort of looking to the future, using Uber as an example of the future. So what do you see coming, or what do you hope to see if you think about just a broader industry with respect to commercial tools over the next, say, three to five years that might make it dramatically easier for a mainstream company that doesn't necessarily have Uber's technical bench and depth to build this type of application. In particular, how might other companies that need to manage hundreds of thousands of digital twins design their applications using more off the shelf technology? Do do you expect that will be possible in, let's call it the midterm future. Yeah, you know, I think the whole landscape around developer tools, applications, it's a rapid evolving space. You know, what was possible now was not possible five years ago. And it's constantly changing. But what we see is, you know, we need to provide value at upper layers of the stack, right? And then if there is some solution that can provide something off the shelf, we move to that so then we can focus up the layer. Like it's not just building, taking off the shelf IAs or past solutions. Just taking the sheer complexity of representing configuration, representing the geodiversity around the world, and then building something that can work for any use case in any country adhering to those specific local rules, that is what I see as like the core strength of Uber.

Like we can manage any kind of payment disbursements or payments in the world. We have the largest support for many payment, like any payment method in the world for earners who are disbursing like billions of payouts to whatever bank account and whatever payment method they need money in. We have a risk system that can handle nuanced use case around risk and fraud. Our system around fulfillment that's managing this, our system around maps that is managing all of the ground tolls, surcharges, navigation, all of that, so we have probably one of the largest global map stack where we manage our own navigation and leveraging some data from external providers. So this is the core IP and core business strength of Uber and that is what is allowing us to do many verticals, but again, the systems that I can use to build this that over time, absolutely I see, you know, it makes it easier for many companies to leverage this. Maybe 15 years ago we didn't have Spanner, so it was much harder to build this. Now with Spanner or with similar new AQL, other of the shelf databases, it solves one part of the challenge, but then now we need to think about other layer of the challenge. Yeah, I'm so excited that you were able to come on, Uday was able to come on, because George, you and I have been talking about this as the future and I think Uday just solidified it, but I think, George, we set a new record for breaking analysis in terms of time, but George, what are your takeaways? Anything last words that you would have to add before we break? I think the takeaways are, I think this is one of those applications that people will look back on many years from now and say, you know, that really was the foundation for a new way of doing business, not just to building software, but of doing business. Like Amazon was the first one to manage their own internal processes, you know, where they're orchestrating the people, places, and things with an internal platform, but you guys did it for an external ecosystem and, you know, made it accessible to con consumers, you know, in real time. And I think the biggest question I have, and it's not really one that you can answer, but it's one that we will have to see the industry answer is to what extent the industry will make technology, make it possible for mainstream companies to start building their own Uber platforms to manage their own ecosystems. That's my takeaway and my question. Yeah so, okay, we're going to leave it there, Uday. Thanks so much. I really appreciate your time and your insights and love to have you back. Yeah, absolutely. Anytime. Bring me up, I'll be there anytime. Excellent. >> Thanks, Uday. Thank you so much. It was a pleasure talking to both of you today and being on Breaking Analysis. Fantastic. On behalf of George Gilbert, I want to thank Uday and his team for these amazing insights on the past, present, and future of data-driven apps. Well, I also thank Alex Myerson who's on production and manages the podcast, Ken Schiffman as well, Kristen Martin and Cheryl Knight to help get the word out on social media and in our newsletters, and Rob Hofe is our editor-in-chief over at siliconangle.com. Thank you so much, everybody. Remember all these episodes are available as podcast. All you got to do is search Breaking Analysis podcast, pop into headphones, go for a long walk on this one. I publish each week on wikibon.com and siliconangle.com. Or you can email me directly at david.vellante@siliconangle.com, @dvellante to DM me, or comment on our LinkedIn posts and check out etr.ai. They got great survey data on enterprise tech. This is Dave Vellante for the Cube Insights, powered by ETR. Thanks for watching and we'll see you next time on Breaking Analysis. (upbeat synth music)