- info
- Transcript

From theCUBE Studios in Palo Alto and Boston, (enlightening music) bringing you data-driven insights from theCUBE and ETR, this is "Breaking Analysis" with Dave Vellante. In the next three to five years, we believe a set of intelligent data apps will emerge that will require a new type of modern data platform, a sixth data platform to support them. In our research, we've used the metaphor of Uber for all to describe this vision, meaning software-based systems that enable a digital representation of a business, a digital twin, if you will, where various data elements like people, places, and things from the real world are ingested, made coherent, and combined to take action in real time. Now, rather than requiring thousands of engineers to build this system, like Uber did, we've said this capability will come in the form of commercial off-the-shelf software. Hello and welcome to this week's CUBE Insights powered by ETR.
To explore this premise, George Gilbert and I welcome Ryan Blue to this, our 201st episode. Ryan is the co-creator and PMC chair of Apache Iceberg and co-founder and the CEO of Tabular, a universal open table store that connects to any compute layer built by the creators of Iceberg. George, Ryan, welcome to the show. Good to be here. Yeah, likewise, thanks for having me on. Okay, before we get into it, let's review what we see as today's sort of five most prominent data platforms. So you know, here we list them. Snowflake, they've got the Data Cloud. Databricks's Lakehouse, Delta, and all the other things that we heard about at the recent Databricks Summit. Unity, the Spark execution engine. Google's got BigQuery, you know, making a lot of noise with their AI, VertexAI in particular, they're sort of LLMs that allow you to, you know, customize your own. Microsoft's got Synapse and the recently announced Fabric. Amazon's got its bespoke, you know, set of tools: Redshift, Lake Formation, Glue that sticks 'em together. But interestingly, we've got this emerging open multi-vendor modular set of platforms. Iceberg is an example.
We're using Starburst, Dagster, dbt Labs, whom we've had on before. And the question that we're asking here is, can a sixth data platform emerge from these components? But before we get into it, you know, George, anything you'd add to these, you know, five great modern data platforms? Well, there's a big delta between the data platforms we have today and that vision of the real-time intelligent data app. And the biggest question I have is, are the components of which Ryan and Tabular with Iceberg are, you know, a representative, are they going to offer plugin alternatives to components of today's data platforms? Or will there be the ability to compose and assemble an alternative data platform? And then what are the trade-offs between that open assembly, if it's possible, and the integrated and generally, typically simpler to operate, but usually more expensive to, you know, to buy? That's the big question I hope we explore with Ryan. Okay, great, and Ryan, anything you want to comment on the sort of today's modern data platforms? I mean, you know, we moved on from Hadoop, we saw the separation of compute from storage, you got a lot of VC investment, a lot of, you know, IPOs that were, I guess, successful. I guess, we'll see how that all ends up, but a lot of, certainly, a lot of market momentum around those five. So when you guys talk about these platforms, I think that we've already seen the modularity starting to creep in. So if you look at just the last three of these, Google, Microsoft, Amazon, and to a certain extent, Databricks and Snowflake coming on board, we're already seeing that these are not platforms that are a single system, like a single engine, a single storage layer. These are already complex things made of very diverse products. So you're already seeing engines from the Hadoop space in Microsoft Synapse, right? It's Spark, right? Spark is also part of the Databricks platform. A huge imperative for Snowflake these days is how do we get data to people that want to use Spark with it?
And so I think we've already started to see these systems decomposing and you know, becoming a collection of projects that all work together rather than one big monolithic system. The question is just, you know, do the, with all the VC investment that you were alluding to, you know, how long are we going to wait until all of those components work together really well? And what needs to change? Yeah, you bring up some really interesting points. Let's take a look at some of the spending data to put these in context. Ken, if you could bring up the next chart. And I want to sort of riff on some of the things that Ryan just talked about. So this is data from ETR, our survey partner. And it focuses in on the cloud computing data platforms. It's the data warehouse and database sector. So that's why we have the asterisk specifically next to Microsoft. Microsoft is ubiquitous, and there's probably a lot of on-prem SQL Server that creeps its way into the survey, but it's a sizable survey of over 1,100. The vertical axis is spending momentum or net score. It's a representation of the net percent of customers in the dataset that are spending more on a particular platform. And the horizontal, it's listed there as overlap, but it's really a proxy for penetration into the dataset.
And that red dotted line at 40% is a high watermark. So you can see all five of these data platforms are above that high watermark. So they have strong representation of the market, and they have a good momentum. The only one that's relatively, you know, small is Databricks, but it's kind of new to this space. We put Oracle on there because certainly Oracle is trying to keep pace with, "Hey, we've got that too." But you can see it doesn't have nearly the momentum, and it gives the other five context. And so that's sort of one data point. If you go onto the next slide, Ken, this is a slide from the ETS Survey, the Emerging Technology Survey, and it basically looks at market sentiment. So it's net sentiment, it's not a spending momentum, it's intent to engage with the platform.
These are all private companies and then its mind share on the horizontal platform. And you can see we've just picked out a couple of these sort of component players, Fivetran, AtScale, sort of the semantic layer, Starburst doing some things with data and data mesh, dbt Labs, which is sort of a API-fying metrics inside of data warehouses. And you can see that squiggly line is the progress made by Starburst since November of 2020. You would see similar lines with these other players as well. So they're all sort of moving fast, gaining market momentum. They're reasonably well funded. And we see them as sort of components in this whole picture. So maybe starting again with George and then Ryan and see if you have any thoughts on the data that we just shared. Well, my big question is something, Ryan, you brought up before, which is are we seeing the modularization of previously integrated proprietary data platforms or are we seeing a bunch of multi-vendor modules emerging? And what are the use cases if so that's drawing those multi-vendor components into the marketplace? And starting with Iceberg, you know, where is it being drawn in? And then you know, what use cases and what other modules do you think that are open and multi-vendor might emerge? And then picking up on that, I mean, just something that you sort of alluded to before, Ryan. You were saying, "Hey, we're already seeing that sort of modularity." I mean, look at Snowflake, it's got this single integrated system, and they announced last two, you know, last year, support for Iceberg. They've improved upon that this year. They announced SingleStore, which is still not in general availability, allowing them to bring in transaction data. So to your point, Ryan, that single integrated system, which is often the way that these markets sort of get traction, and then they sort of decompose. But maybe some of your thoughts on this. Yeah, absolutely, if you take both Databricks and Snowflake, they want access to all of the data out there. And so they are very much incentivized to add support for projects like Iceberg. You know, Databricks and Snowflake have recently announced support for Iceberg, and that's just from the monolithic vendor angle, right? I don't think anyone would've expected Snowflake to add, you know, full right support for a project like Iceberg two years ago. That was pretty surprising. And then if you look at the other vendors, they're able to compete in this space because they're taking all of these projects that they don't own, and they're packaging up them up as, you know, out-of-the-box data architecture.
One of the critical pieces of this is that we're sharing storage across these projects, and that has, for the data warehouse vendors like Snowflake, the advantage that they can get access to all of the data that's not stored in Snowflake. But I think a more important lens is from the buyer or the user's perspective, where no one wants siloed data anymore. And they want a, essentially, what Microsoft is talking about with Fabric. They want to be able to access the same datasets through any different engine or means, whether that's a Python program on some developer's laptop or a data warehouse at the other end of the spectrum like Snowflake.
So it's very important that all of those things can share data. Of course, that's where I'm coming from. So that's what I'm most familiar with. You know, if you go above the layer of those engines, I'm less familiar, but we even see that consolidation with integrations like Fivetran writing directly to Iceberg tables. Right, and it's interesting 'cause Snowflake initially got some traction with this concept of data sharing they were early on, but basically saying, "Bring everything into Snowflake, and it'll be governed, and you'll be safe, and we'll break down all those silos within our big silo." And they realize, "Oh, well, actually, not all the data's going to end up here." So we have to, you know, kind of begin to open up. And it's, to your point, the end customer ultimately will be the arbiter of how these shifts occur. I wonder if we could bring up, Ken, the key questions sort of the next slide, and we're going to come back to this slide. I'm going to keep asking you to bring this up over the course of this conversation. So the first one that we wanted to explore with Ryan is, you know, you had Hadoop, which was actually quite atomic. All these piece parts, it was very service heavy.
You bring in some outside experts or you had some guys in lab coats who knew how to do this stuff, and essentially, you know, just never was able to get the traction, the function of, you know, Spark disruption and of course, the cloud too. And then we moved to this sort of highly integrated, you know, Snowflake. And the question we had are what are the implications on usability, cost, and value? And the reason we ask that is, 'cause at the most recent Snowflake Summit, we talked to customers, we talked to some of the ecosystem partners, who said, "You know, we're seeing a lot of our customers that are doing the data pipeline or the data engineering outside of Snowflake, saying it's too expensive." And we've seen that. Now we've talked to Snowflake about it, and they've said, "Well, actually if you do this inside of Snowpark, you get the full value of the TCO, et cetera." And so there's that interesting debate that's going on, but we've clearly seen in the ETR data, some of the momentum in Snowflake in terms of percentage of customers spending more has come down.
But at the same time, there is that value proposition. So what are the implications, in your view, around that usability, the cost, and ultimately, the value? I think you believe in an open modular world. How do you see that playing out? I do believe in the modular world. I think that the biggest change in databases in the last, you know, 10 years easily, if not longer, is the ability to share storage underneath databases. And that really came from the Hadoop world because those of us, we weren't wearing lab coats, but those of us over there, we sort of had as an assumption that many different engines, whether they're streaming, or batch, or ad hoc, they all needed to use the same datasets. So that was an assumption when we built Iceberg and other similar formats. And that's what really drove this change to be able to share data. Now, in terms of cost and usability, that is a huge, huge disruption to the business model of, basically, every established database vendor, who have, you know, lived on the Oracle model.
And it's not really even the Oracle model, it's just that storage and compute have always been so closely tied together that you couldn't separate them. And so by winning workloads, winning datasets, you also were winning future compute, whether or not you were the best option for it. Best option meaning what people knew how to use, best in terms of performance, and cost, and things like that. And so the shift to sharing data means we can essentially move workloads without forklifting data, without needing to worry about how do we keep it in sync across these products, how do we secure it, all of those problems that have been inhibitors to moving data to the correct engine that is the most efficient or the most cost-effective, et cetera. So I think that this shift to open storage and independent storage in particular is going to drive a lot of that value and basically, cost efficiency. Dave, let me follow up on that 'cause you said two things in there which were critical. The shared storage, which means you compete, you know, for what's the best compute for that workload, but someone's still got to govern that data. And let me define what I mean by govern. It's got to make sure that multiple people aren't reading and writing, and I don't mean just a single table 'cause you might be ingesting into multiple tables. Someone's got to apply permissions and their broader governance policies. So it's not just enough to say we've standardized the table format. Something else in there has to make sure that everything's, in the non-technical word, copacetic. What are those services that are required to make things copacetic? Whether it's, you know, transactional support, whether it's, you know, permissions or broader governance. Tell us what you think needs to go along with this open data so that everyone can share it. That's a great question. I think that access controls are one of the biggest blind spots here. So the data lake world, which is essentially what I define as the Hadoop ecosystem after it moved to cloud and started using S3 and other object stores, this is a massive gap for the data lakes and for shared storage in general. You know, spoiler or disclaimer, this is what my company Tabular does. We provide this independent data platform that actually secures data no matter how you're getting, you know, sorry. That actually secures data no matter what you're using to access it, whether that is that Python process or you know, Starburst Galaxy, for example. So I think that that is a really critical piece. What we've done is actually taken access controls, which if you have a tied-together or glued-together storage and query layer, those go in the query layer because the query layer understands the query and can say, "Oh, you're trying to use this column that you don't have access to." In the more modern, you know, sixth platform that you guys are talking about, that has to move down into the storage layer in order to really universally enforce those access controls without syncing between Starburst, and Snowflake, and Microsoft, and Databricks, and whatever else you want to use at the query layer. Right, so if you bring up the slide again, the question slide, it sounds like you buy the premise that a sixth data platform will emerge. You started getting into sort of the components. Obviously, Iceberg is one of them that are going to enable this vision. And the role that Iceberg plays, it sounds like you're sort of aligned with that. There maybe is some fuzziness as to how that all plays out, but this is something that George and I definitely want to explore. Yeah, I think that the sixth data platform is as good a name as you can come up with, right? I don't think that we know what it's going to be called quite yet. I don't know that I would consider it distinct because I think what's going to happen is all five of those players that you were talking about, plus Oracle, plus IBM, and others are going to standardize on certain components within this architecture. Certainly shared storage is going to be one. And I believe that Iceberg is probably that shared storage format because it seems to have the broadest adoption. You know, if Databricks and Snowflake can agree on something, then that's probably the de facto standard.
But, you know, I think that that is Iceberg. We're still going to see what we can all agree on as we build more and more of those standards. In the Iceberg community, we're working on shared things like views, standardized encryption, you know, ways of interacting with a catalog no matter who owns or runs that catalog. And I think those components are going to be really critical in the sixth data platform because it's going to be an amalgam of all of those players sharing data and sharing or you know, being part of the exact same architecture. And you know, George, you and I sort of debate or least discuss quite frequently, like I always say, what about Oracle? And you're like, "Yeah, okay." And to Ryan's point, these, the existing five are clearly evolving. You've seen Oracle. I mean, Oracle, somebody announces something, then Oracle, Larry will announce it and act like they invented it. And so they've maintained, at least their spending on R&D and maintaining relevance, which is a good thing. But I think, George, the thing that we point to is the real-time nature that Uber for all, which we feel like many of these platforms are not in a position to capture. And that's maybe a little more forward thinking or years out. But the idea that, and we've talked about things like the expressiveness of graph and knowledge databases with the query flexibility of SQL, or we've seen some of the things that like Fauna is doing by being able to cross multiple documents and join across multiple documents in real time. Those are some of the things that we're thinking about and why we feel that some of the five are going to be challenged. But George, I'd love for you to pick up on that and maybe follow up. You're talking, Dave, now closer to that real world of, like, intelligent data apps that we were describing with, like Uber for all, where, you know, you've got digital twins fed in real time. And let me sort of map that back down to Ryan and where we are today, which is, do you see a path, Ryan, towards adding enough transactional consistency or a transactional model that you can take in real-time data that might be ingested across multiple tables and you understand windows, but at the same time, you can feed a real-time query that's looking at historical context, you know, and the real-time window.
And then I guess what I'm asking is sort of how far up this stack you're going? And you know, the other part is it's again that, it's related to sort of who's keeping track of making sure of, you know, data integrity and essentially, access integrity, you know? Where are the boundaries? Are there boundaries? Is it going to be all in the storage manager? You know, help enlighten us as to, essentially, the integrity, you know, the real-time nature, and then how you start mapping that to higher-level analytics and digital representations, digital twins. It's a big question, but help us unwrap it. Yeah, so I think you brought up a couple of different areas, and I'll address those somewhat separately. So first of all, in terms of transactions, that's one of the things that the formats or the open data formats, and I'm including just Delta and Iceberg in that. What they do is essentially have a transaction protocol that allows you to safely modify data without knowing about any of the other people trying to modify or read data. They're the two formats that do that and are open source. So that I think is a solved problem. Now, the issue that you then have is they do that by writing everything to an object store and cutting off a version, which is inherently a batch process.
And that's where you start having this mismatch between modern streaming, which is a micro-batch streaming operation and efficiency because you need to, you know, at each point in time, commit to the table. Every single commit incurs more work or something like that. So in order to get towards real time, you're simply doing more work, and you're also adding more complexity to that process. So I think that the, essentially, you're seeing cost rise, at least linearly if not exponentially, as you get closer and closer to real time. I think that the, you know, just the basic economics of that makes it infeasible for in the long term to really make that real time, something that you're going to use 100% of the time.
And this is the age old trade-off between latency and throughput. If you want lower latency, you have lower throughput. If you're willing to take higher latency, you have higher throughput and thus better efficiency. So I think that where we need that streaming and the sort of constantly fed data applications, those are going to get easier to build certainly, but I think that after those is probably where you're going to go and store data, make it durable for this sixth data platform. And it'll be interesting to see the interplay between those real-time sort of streaming applications and how we, you know, sort of merge that data with data from the lake or warehouse. Okay. >> And George, when we, you know, we think about what, when we had Uber on, they were sort of describing how they sort of attack that trade-off. And now that's a unique application, riders, drivers, ETAs, destinations, you know, prices, et cetera, but you could see sort of in, you know, Industry 4.0 applications and IoT, you know, some potential there, but George, your premise has always been it's going to apply to sort of mainstream businesses. Ken, if you bring that slide back up, the question slide, George and I love to go right to dessert. I think we've hit on all of these, Ryan, you know, what other building blocks beyond Iceberg and Tabular are going to enable choice and modularity in this world? What are the implications for today's modern data platforms? And we're talking about how will applications evolve, where we can support this physical world model.
I think I'm hearing from Ryan, George, that, you know, that's kind of somewhat limited, at least today, in terms of our visibility and scope. The activity is going to be really along these analytic applications, but you're seeing a lot of analytic and transaction activity coming together. So George, pick it up from there. Let me follow up on that, Dave, which is if I parse, Ryan, what you were saying, you get only so far with this trying to get to asymptotically, you know, rising or asymptotic efficiency in terms of real-time, say, ingest. What I think I'm inferring from what you're saying is if you want a real-time system of truth, you're going to use an operational database. If you want a historical system of truth that's being, you know, continuously hydrated, that might be a separate analytic system. Am I understanding that right? Is that how you see the platform evolving, where you're going to have separate operational database and separate sort of historical system of truth, to borrow the term from Bob Muglia up? So yes, I think I would subscribe to that view, at least in the short term. The way that we are tackling the challenge of tearing down the data silos between these data platforms that, you know, the major players that you had up on the earlier slide, the way that we're doing that is to essentially trade off some of the machinery that would go to support those real-time, like, you know, sub-second latency applications. So I don't think that we're going to approach merging those two worlds anytime soon. However, I'm a big believer in usability, and it's not that we need to merge those two worlds, it's that it needs to appear like we are, right?
We can have separate systems that make a single use case work across that boundary. Netflix did this, I think, classically with logs coming from our applications. We had telemetry and logs coming from runtime applications that are powering the Netflix product, you know, worldwide. We need access to that information with, you know, millisecond latencies. Iceberg was not a, you know, a runtime system that provided that. We kept it all in memory because you know, that was what made the most sense. But absolutely for historical applications, we did. And to a user, that trade-off was seamless. So you would go and query logs from your running application, and you'd get millisecond logs, you know, sorry, logs that are fresh to the millisecond, but you could also go back in time two years. You know, things like that, those are the applications that are coming. How did you provide that seamless simplicity to the developer so that they didn't know they were going against two different databases? Through building a data app that was a ton of work and understood both of those data storage domains. So the app was, or the backend at least for it, was responsible for receiving all of the logs in, you know, a almost real time, and for storing them, and managing the handoff between in-memory and Iceberg. It's in, George, I would, again to use the Uber example. Correct me if I'm wrong, but they're essentially using Google Spanner that's globally distributed, you know, strictly consistent database, but then they've got, you know. I think they use the example that, "Hey, we don't update, you know, the pricing in real time." And of course, we can go back and look at historical data as well, which is in a separate data store. I forget what they were using, BigQuery or maybe it was some Postgres hack, I can't remember. But George, is that not a similar analogy? It's a perfect analogy. But a follow-up question for Ryan, which is I just want to understand the separation that has to go down into storage of, like, permissions and maybe technical metadata that, I guess, maybe is stored as part of Iceberg. You know, like what the tables are? What columns are there, you know? What tables are connected, you know, in this one? If they are in this one repository or if, I don't even know if it's called a database? And then does that interoperate with a higher level open set of operational catalogs like a Unity or today, what's inside Snowflake? In other words, is there a core set of governing metadata that's associated with the storage layer? And then everything else, all the operational metadata above that is interoperable with these open, you know, open modular compute engines. So yeah, in terms of Iceberg, we do make a distinction between technical metadata required to read and write the format properly and higher-level metadata. And that higher-level metadata, we think of as business operations info. Even things like access controls, and RBAC policy, and all of that, that's a higher level that we don't include in even the Iceberg open-source project. The technical metadata is quite enough to manage for an open-source project. We need to build higher level-structures on top of that. The reason I ask, Ryan, this is critical 'cause you identified how the business model of all the data management vendors changes when data is open, and they don't get to attach all workloads because they manage that data. Now they have to compete for each workload on price and performance. So my question is, and this, I go back to this question about who keeps the data copacetic. The more metadata you have, essentially, that tells you, that governs all that data, it sounds like that's the new gateway. Is that a fair characterization or is there no new gateway? I think that it is absolutely critical to have a plan around that metadata. We don't really know how far we're going to go down that road. So I think that today, there's a very good argument that access controls and certain governance operations need to move into a catalog and storage layer that is uniform across all of those different methods of access and compute providers. What else goes into that, I think, is something that we're going to see. I think that that is by far the biggest blocker that I see across customers that I talk to. You know, everyone has already Databricks, and Snowflake, and possibly some AWS, or Microsoft services in their architecture, and they're wondering, "How do I stop copying data?
How do I stop syncing policy between these systems?" I know that we need to solve that problem today, but the higher-level stuff, 'cause usually, you know, like we work with a financial regulator that has their own method of tracking policy and translates that into the RBAC access controls underneath. You know, how you manage that policy may be organization-specific. It might be something that evolves over time. I think it's, you know, we're just at the start of this market, where people are starting to think about data holistically across their entire enterprise, you know. And including the transactional systems, how that data flows into the analytic systems, how we secure it, and have the same set of users, and roles, and potentially access controls that follow that data around. This is a really big problem. And I wish I had a crystal ball, but I just know that the next step is to centralize, and have fewer permission systems, and you know, just one that maybe covers 95% of your analytic data is going to be a major step forward. And you're definitely seeing, as you said earlier, Ryan, signs of this, where you see, you know, at least most of the top five, if not all, are beginning to, you know, adopt these sort of modular, open stances. You see things, like, sort of brute force, but look at MySQL heat wave that Oracle has done bringing transaction and analytic together with, you know, monster memories. Ken, bring back the questions, if you would. I mean, I think, George, we hit on all of these. I don't know if you had any final questions for Ryan before we went into... I want you to lay out, George, your sort of vision of what this modularity looks like, and then give Ryan the last word here. Well, I guess my follow-up question for Ryan would be now that we're prying open some of these existing data platforms, they had modularity, but generally, they were working with their own modules. And you're providing now open storage that can work across everyone so that we have a data-centric world instead of a data-platform-centric world. And my question would be, you really help crystallize it, where everyone's now got to compete on workloads. What are the workloads that are first moving to a modular world where people are saying, "Let me choose a more appropriate price or performance point for data that I can maintain in an open format"? I think right now, companies have a different problem. It's not like they're looking at this and saying, "Ooh, I want to move this workload." Or maybe they are, but it depends on where they're coming from. If you already have Spark and you need a much better data warehouse option, then you might be adding Snowflake. You might also be coming from Snowflake and going towards ML in Spark. You know, those sorts of things that I can't really summarize that. I would say that the biggest thing that we see in large organizations is that you have these pockets of people that, you know, this group really likes Spark, this group is perfectly happy running on Redshift, someone else needed the performance of Snowflake, and the, you know, CIO level is looking at this as, "What is our data story?
How do we fix this problem of needing to sync data into five different places?" I was talking to someone at our company that used to work in adtech just this morning, and he said that they had, you know, five or six different copies of the exact same dataset, where different people would go to the same vendor, buy the dataset, massage it slightly differently to meet their own internal model, and then use it. And it's those sorts of problems that it's like, "Let's just store it once. And like, let's store all of this data once and make it universally accessible." We don't have to worry about copies. We don't have to worry about silos. We don't have to worry about, you know, copying governance policy. And is it leaking because, you know, Spark has no access controls while I've locked everything down in, say, Snowflake or Starburst?
You know, it's just a mess. And so the first thing that people are doing is trying to get a handle on it and say, "We know we need to consolidate this. We know we need to share data across all of these things." And thankfully, we can these days. You know, 10 years ago, the choice was we can share data but we have to use Hadoop, and it's unsafe, and unreliable, and very hard to use. Or we can continue having both Redshift, and Snowflake, and Netezza in our architecture, you know. So we're just now moving to where it's possible, and we're discovering a lot along the way. Hey, guys, first of all, I have to congratulate you with 39 minutes in, and we haven't said AI, so well done. I wonder, Ken, if you could bring up the last slide. George, I'd like you to unpack this. Explain what we're looking at here, what your vision is, and I want to get Ryan to comment, and then we'll wrap. Okay, and I would say, you know, this is not really my vision. I think this is my attempt to illustrate what we see unfolding in the market that, like, Ryan is helping us articulate, which is that we start on the left, and we had a DBMS or data-platform-centric view of the world, where it might be, you know, Redshift or Snowflake, where the state-of-the-art of technology required us to integrate all the pieces to provide the simplicity and performance that customers needed. But as technology matures, we can start to modularize things. And the first thing we're modularizing actually is storage. Now, it means we can open up and offer a standardized interface to tables, whether it's Iceberg, or Delta tables, or Hudi.
And what Ryan is helping us articulate is that permissions have to go along with that, and there's some amount of transaction support in there. And then we're sort of taking apart the components that were in an integrated DBMS. Now, it doesn't mean you're going to necessarily get all the components from different vendors, but let's just go through them. There's, like, a control plane that orchestrates the work. Today, we know these as dbt or Dagster, Airflow or Prefect. There's the query optimizer, which is the hardest thing to do, where you can just say what you want, and it figures out how to get it. And that is also, you know, part of Snowflake, it's BigQuery, Azure Synapse Fabric, or Databricks built their own Databricks SQL, which was separate from the Spark execution engine.
The execution engine is an SDK sort of non-SQL way of getting at the data. That was the original Spark engine. Snowflake has an API now, but I think it goes through their query optimizer. And there's the metadata layer, which is, you know, beyond just the technical metadata. And this is what we were talking about with Ryan, which is, like, how do you essentially describe your data estate and all the information you need to know about how it hangs together? And that's like AtScale with their metrics semantic layer. There's Atlan, which, and Alation and Collibra, which are sort of user and administrator catalogs. But the point of all this is to say that we're beginning to unbundle what was once in the DBMS just the way Snowflake unbundled what was once Oracle, which had compute and storage together.
They separated compute and storage, and now, we're separating compute into multiple things, to Ryan's point, so that we can use potentially different tools for different workloads, and that they all work on one shared data estate, that the data is not embedded in and trapped inside one platform or engine. That's the question is how are we getting there? Do we have the components right? What is that world going to look like when we get there? It has big implications for the products customers buy and for the business model of the vendors selling them. So Ryan, I mean, I feel like George's picture, thank you George for sharing that, you know, map pretty well to our conversation, but anything you'd add or any final thoughts that you want to bring forth? So just the high-level modularity versus simplicity, I think we absolutely need both, right? The modularity is clearly being demanded. The simplicity always follows afterwards. You know, databases power applications, so there's always been a gap in code, and who controls what, and these things. We're just adding layers, you know? We've done pretty good at knowing the boundaries between a database and an application on top of it and making that a smooth process. I think that we're doing that again, you know, separating that storage and compute layer. But we absolutely need both. And this is where some of the newer things that we've been doing in the Apache Iceberg community come into play: standardizing how we talk with catalogs, making it possible to actually secure that layer, and say, "Hey, this is how you pass identity across that boundary." We're also, you know, moving database services from the query layer for maintenance into the storage layer.
That's another thing that's moving. That modularity needs to be followed by the simplicity, things that use OAuth, you know. We're pioneering a way to use OAuth 2.0 to connect the query engine to our storage layer so that you can just click through and have an administrator say, "Yes, I want to give access to this data warehouse to Starburst," or something like that. That ease of use, I think, is really the only thing that is going to make modularity happen because I mean, again, the big failure of the Hadoop ecosystem was that simplicity and that usability. We've only been able to see the benefits of that by layering on and maturing those products to add the simplicity. And so that's absolutely a part of where we're going. Excellent, well, Ryan, I really want to thank you for coming on the program and sharing your perspectives. And wish you all the best with Tabular. Thank you, it was a lot of fun. I appreciate you guys having me on. Yeah, you bet, and of course, thanks to my colleague George Gilbert, and Alex Myerson, who's on production and manages the podcast, and Ken Schifman, who's on solo today. Good job, Ken, with the slides. Kristen Martin and Cheryl Knight helped get the word out on social media and in our newsletters. And Rob Hof is our EiC over at SiliconANGLE. Thank you for all the good editing. And remember, all these episodes are available as podcasts. All you got to do is just search "Breaking Analysis Podcast." We publish each week on wikibon.com and siliconangle.com, and you could email me directly, david.vellante@siliconangle.com or DM me @dvellante if you want to get in touch. Comment on our LinkedIn post and please check out etr.ai, get some great survey data. They're accelerating their survey plans, and so definitely check that out. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching. And we'll see you next time on "Breaking Analysis." (enlightening music)