SiliconANGLE theCUBESiliconANGLE theCUBE
  • info
  • Transcript
The Endless Demand for Data Workloads: An Analysis of Industry Dynamics.
Clip Duration 01:33 / September 7, 2024
246 | Breaking Analysis | Is the Modern Data Stack Out Over its Skis?
Video Duration: 30:53
search

From theCUBE Studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante.

Dave VellanteData from the community suggests that acceleration in compute performance and the sophistication of the modern data stack is outpacing the needs of many traditional analytic workloads. Most analytic workloads today are performed on small data sets and generally can run on a single node neutralizing the value of distributed highly scalable data platforms. As such, we believe today's modern stack, which started out serving dashboards, must evolve into an intelligent application platform that harmonizes the data estate and supports a multi-agent application system. Hello and welcome to this week's theCUBE Research Insights powered by ETR. In this breaking analysis, George Gilbert and I welcome FiveTran CEO, George Fraser. FiveTran, as you may know, is a foundational software company that has a front row seat in exceptional visibility on data flows, size of data, sources of data, how data is being used and how it's changing.

In this episode we explore the thesis that much of the analytics work being done on data platforms is being commoditized by good enough cost-effective tooling, which is forcing today's modern data stack leaders to transform. So before we get into it, many folks in our audience are familiar with the epic series of lectures by Clay Christensen at Oxford University where he so brilliantly described his theory of disruptive innovation using the steel industry example. For those not familiar with the lectures, Christensen explained his model in the context of the steel industry where historically most steel was made by big integrated steel mills that cost at the time that he did the lecture, $10 billion to build. But these things called mini-mills emerged in the late 1906s that melted scrap metal in electric furnaces and could build steel for 20% less than the integrated mills.

Now the mini-mills, they became viable in the late 1960s. And because the quality of steel made from scrap was uniformly poor, mini-mills went after the low end of the market, the rebar market, that low margin business that we show here. And the integrated mills said, "Hey, that's great. Let the mini-mills have that crummy 7% gross margin." And by getting out of the rebar market, the integrated steel companies, their margins improved. Faking them out, making them think, "Hey, we're doing better now." But when the last of the integrated mills exited the rebar market in the late 1970s, prices tanked. It commoditized rebar. So what happened? The mini-mills moved up the stack, up market where there was a price umbrella from the integrated mills. And the same thing happened with angle iron steel and then structural steel. And it kept going on until the integrated steel model collapsed and all but one manufacturer went bankrupt.

And as Christensen points out, this happened in autos with Toyota and we've seen it in many other industries and we've certainly seen it in computer systems. And we think a similar dynamic could take place in the software industry generally and specifically in the data platform stack. And so if we apply Christensen model to the modern data stack, we show this chart as the data equivalent of the steel industry. So ingest, transform and BI are rebar, angle iron, and structural steel respectively. Now just to be clear, we're talking about the demands on the data platform in this analogy. The connectors that extract data from the application are sophisticated, but the demand it places on the system to land the data is relatively minimal, relative to the increasing scale and sophistication of today's modern data platforms.

In other words, what we're saying is that for most workloads, the sophistication in terms of the distributed scale out workload management demand is outstripping what's needed for most workloads in the modern data platform. In order to justify the investments in their sophistication, have to move up the stack to new workloads. And where the analogy of course breaks with steel is the integrated steel mills, they hit a ceiling of the stack. There was nowhere else to go unless they wanted to start doing construction. Firms like Snowflake and Databricks, they have TAM expansion opportunities that we're going to discuss. Okay, so with that as background, let's bring in George Frazier. Bring him into the conversation. George is the CEO of FiveTran, as we said. It's one of the iconic companies that defined the modern data stack. Many consider Snowflake, FiveTran, DBT and Looker were some of the four horsemen, if you will, of the original MDS.

And this slide from ETR will just give you a sense of how prominent a position FiveTran has in the market. The data is from the August, so just last month, survey of 1,349 IT decision makers. And it shows emerging companies, that means privately held companies, in the data analytics and integrations sector within the ETR data set. Net sentiment is shown in the vertical axis, which is a measure of intent to engage. And the horizontal axis is mindshare. And you can see the dramatic moves that FiveTran has made since 2020 shown by that squiggly blue line shifting to the right. George Frazier, welcome to the program. First, congratulations on the company that you and the team have built. As we said, you have unique visibility on what's happening in the market.

The question is, are we out over our skis with today's modern data stack? Specifically with respect to the sophistication and scale that we were talking about of that distributed data platform relative to the vast majority of workload requirements that are running out there? Please share your thoughts.

George FraserYeah, great to be with you. We thank for <inaudible> and showing that cool chart. I think it's a really interesting moment. The data <inaudible> planning is stronger than they've ever been, but I think there is this interesting question of whether maybe we're, after a long period, really, 10 years of bundling, who knows, it might be a moment of unbundling coming up again. And there is this very interesting fact that hangs over the whole world of data platforms that we participate in, which is most data sets are much smaller than I think people realize. We see this as FiveTran as a company putting data into data platforms. What we find is that a lot of these huge data sets that get talked about, they really originate from inefficient data pipelines that are doing things like storing another copy of every record every night when the data pipeline runs.

And if you have an efficient data pipeline, the data sizes end up not being that big. There's a lot of workloads running on these data platforms. In many cases, you're looking at many small queries rather than one huge query. And one of the things that's happening right now is the emergence of data lakes. Data lakes are the fastest growing destination for FiveTran. Iceberg and Delta data lakes. And one of the implications of data lakes, we think of data lakes as a thing that's used to store huge data sets, but data lakes have other characteristics. One of them is that if you have a data lake, you can bring multiple different compute entities to bear into the same data and that opens up the possibility of using specialized compute engines that are maybe more optimized for the data sizes that people are actually working with every day. So there's lots more that I can say about this, but I'll pause there for a moment. It's an interesting moment and there's an interesting mismatch between, I think, the way the systems get talked about and the way they actually get used in practice.

Dave VellanteGreat. Well, thank you for that. And George, let's double click on that notion of data sizes. This slide shows that only 10% of the queries are more than 100 megabytes and only another 9% go up to 10 gigabytes. So 99% of the queries are less than 10 gig. And that's because, think about it, most queries, they're done on data that's fresh and that's the valuable data as we've talked about many times in this program. So this data comes from Mother Duck and specifically Jordan Toghani, who was the head of product for Bing query at Google. So the data is only a couple of years old, but we believe it reflects today's market. George Frazier, are we correct that Jordan's data reflects today's reality? I think based on what you just said, you would agree with it.

George FraserYeah. Jordan's spoken a lot about this and I've talked to Jordan about this as well. And I just recently have been looking at Snowflake. And Redshift, actually, a few years ago published a sample of real world queries that run, a representative sample of what real world queries look like. And the data is a little older. <inaudible> Snowflake it goes back to, I think, 2017. The data in Redshift is a little bit fresher. But you can't see the actual query, but you can see summary statistics, including things like how much data they scanned. And what I saw when I looked at this data was that the median query, it was a little higher than the numbers you showed. The median query. So both Snowflake and Redshift scanned 64 megabytes of data. I was astonished when I saw that. Your iPhone could <inaudible> that query.

And there are lots of queries. So you need, it's not as though you could run these workloads that customer are running on your iPhone. But the point is that it's more volume queries rather than the size of data that is a challenge. And consistent with what we've seen in our customers with FiveTran. And it opens up a possibility of this other architecture where a lot of the participants, in let's say this system, we <inaudible> in the future of a data lake with diverse compute engines talking to it, a lot of the compute engines talking to the data lake might be single node compute engines because most of these workloads don't actually need a massively parallel processing system. And one of the cool things about data lakes is that you can have multiple systems collaborating over the same data sets. So for that extreme tail workload where you have a query that really does need those massive resources in order to run, you can have a massively parallel processing systems participate in the same database as other query engines that are more optimized. Or not even query engines. Data frame libraries, things like that, that are more optimized for these small queries that actually represent most of the work that people do.

Dave VellanteSo just following up on that, I want to explore that market dynamic with this. We have some additional data from Mother Duck that we're showing here. What this shows is that the standard instance of an AWS EC-II instance, that's 256 gigabytes of RAM. So you can put the entire data set in memory, meaning you only need a single node, as you were just saying, for the vast majority of queries and workloads. Now to be clear, you're not necessarily going to do all these queries in memory, but the point is you don't need a distributed multi-node cluster to do queries for most situations today. So George F., again, you've mentioned you analyzed recent data from Snowflake and Redshift about that workload sizing. So in your experience, what does all this tell you about where the costs are incurred and the scope of the queries, especially now that customers, they have a lot of choice in terms of compute and execution engines?

Dave, let me jump in.

George FraserI think we're going to see more <inaudible>-

George, I know I'm interrupting, but let me add to Dave's question. Because you had talked about the choice of the execution engines before, but I know you've also done benchmarking on the cost. And I was wondering if you could append that to what you were going to say to Dave about if you look at, from what you guys have seen, like what ingest workloads, what share of total workloads ingest is, what shared transformation is, and then the traditional analytics. But then if you've done benchmarking on cost, now that you could take a, you could have an integrated data platform at the high end, you could have Databricks, which provides an integrated data platform, but it's still modular and then Mother Duck or something like that on the low end.

George FraserWell, I started to look at this data because I wanted to better understand what ingest costs are as a percentage of people's total cost. Because we have seen examples in individual customers and their own usage that ingest, just getting the data into the data platform was about 20% of the total workload. And that's consistent in the larger population. That is what you see. Which surprised me. I would've thought it would've been much smaller. But actually, ingest is a very large fraction of what people do on these platforms. And in terms of what does it all mean, I think it's very hard to say what the future holds. There's an interesting tension embedded in these facts that we're observing. I think what we will see is more diversity of compute engines. So I think the great platforms today are not going anywhere.

I think what you will see is that as these vendors <inaudible> format become more and more popular, you're going to see pieces of the workload start to peel off and use more <inaudible> purpose compute engine for that particular workload. For example, if you sign up today for data lake, the FiveTran data lake offering, the way that works is we ingest the data into the data lake using a service that we built, that's actually powered under the hood by <inaudible>. And so that is an example of a special purpose execution engine designed to do a particular, common thing that's more efficient at doing that one thing, participating in a data lake that is then shared with data platform like Databricks and Snowflake and whatever else the user might do with it.

Dave VellanteSo coming back to that, this is a great segue, George, coming back to this idea that, I call it good enough, but maybe it's really purpose-built tooling has the potential to, again, using the steel industry analogy, commoditize the data stack. So this is current data from DB Engines, which if you don't know who they are, they're the go-to site for database engine popularity. I think many in our audience know them. And what this shows is the increasing score of DuckDB relative to other data platforms. DuckDB has improved by almost two orders of magnitude since 2020. And the point is DuckDB, it's open source and single node. And in our steel analogy, is the mini-mill, whereas Snowflake and Databricks and BigQuery would be the equivalent of the integrated steel mills. George Frazier, what are you seeing? Is there a looming shift? I mean you mentioned that you sort of bundled in, if you will, a purpose-built DuckDB, but do you see a more broad-based looming shift in adoption of open source analytic databases and do you see it as potentially eating into the popularity of the integrated data platform model?

George FraserThe thing about data workloads is that the demand is infinite. And so the dynamics of our industry are always less competitive than people think. Because if you find a way to do things more efficiently, what will happen is the customers will simply ask <inaudible> questions of the data. So for example, Snowflake was very much a drop-in replacement for the previous generation of data warehouses. It was basically better in every way and worse in none. A certain category of things that enterprises already had. And it was 10 times more efficient, more cost-efficient in terms of doing workloads. And what we saw is that people replace their legacy data warehouse with Snowflake, but the budgets did not go down. They just did more stuff. So I think, the short answer to your question is maybe. But remember that the dynamic in the world of data management is, generally as we find more efficient ways to do things, people simply do more things.

Dave VellanteYeah. So the <inaudible>-

George FraserAnd I would also add that it's not just about database engines. It's also about things like data frame execution engines. Something that's not on that graph, but that's getting a lot more popular, is for example, <inaudible>, which is a very fast single-node data frame execution engine. So there's a lot of players running around in the ecosystem. And as the data formats become more open, there's a lot more opportunity for customers to make the match.

Dave VellanteGreat. Thank you for that. You're right. We've seen this in the computer industry over decades. The prices drop, people do more stuff. I want to come back to this idea of the future of the modern data platform and the TAM expansion opportunities. So we've said that in order to justify the sophistication of today's integrated data platforms, we think they've got to add new functionality. And we're showing here the addition of RAG. We show a harmonization layer and then this multi-agent system that we've been talking about for the last several episodes and months here on Breaking Analysis. But these are capabilities that firms like Snowflake and Databricks are working on and or have an opportunity to partner on to bring the integrated simplicity that we think is maybe overshooting many of the traditional workloads at the bottom layers. I wonder, George Gilbert, if you can chime in and explain how you see the addition of those three layers, their value, how they can potentially work together.

Before I do that, I actually want to ask George another question. You have insight into how the today's integrated vendors can improve in terms of speed and simplicity. Like Snowflake now with their declarative data pipelines, with the incremental update, the low latency ingest and processing, and how it can directly feed like a dashboard or I don't know even know if that feature store yet, is that sort of thing, that end-to-end simplicity enough to move the bar up in terms of doing things that the component parts or single node is not enough and is not good enough for anymore? In other words, can they move the goalpost?

George FraserI think for a lot of customers, the simplicity of an integrated system will win the day. You have to remember that the most expensive thing in the world is headcounts. You can say that I can put together DuckDB and this and that and the other thing and build a system that's more efficient mathematically. And you're right. But for any particular company, it doesn't take very many <inaudible> work from an engineering team before you've undone all of the gains that you have. So there are a lot of customers just opening the box and thinking what comes in the box is still going to be the right answer simply because it's efficient from a <inaudible> perspective. Even if maybe there is out there a configuration that more efficient from <inaudible> perspective.

Dave VellanteSo I would add... I would agree, by the way. I think given the complexities around governance and the silos of governance that we see coming and the uncertainty in the marketplace, that integrated play is going to be appealing. However, there could be margin pressures. And I think we're seeing that in some respects. Go ahead, George.

Yeah. I was going to say along the lines of what you were saying, Dave, that the higher end customers who might be spending $10 million a year or more, they could carve off some of these workloads for optimization purposes. But one other question, George. Databricks announced but hasn't shipped yet, Lakeflow, which is their attempt to further simplify their pipelines. What's your opinion on where that will be when it lands versus where Snowflake, for instance, has improved theirs with their declarative low latency pipelines?

George FraserWell, Lakeflow is a bit different. Snowflake's declarative low latency pipelines, like Dynamic Cable, they're really a way to efficiently run transformation pipelines inside the warehouse. The type of things people do with DBT. And they might be a good materialization strategy for DBT, if you know what that means. Lakeflow is a competitor with FiveTran. But it's a way of getting data from relational databases and a couple other systems into Databricks. All of our destination <inaudible> and Databricks is one of our partners. They do like to build connectors sometimes as a hobby. I think we find that building connectors is very difficult, especially covering all of the many scenarios that you do encounter in reality. And so I think FiveTran will continue to succeed, including in Databricks because of the breadth of our connectors and because of the depth of support for configuration scenarios of the sources. Sorry about that. Little bubble there. But yeah, Lakeflow is a different animal.

Okay. I thought it was their attempt to build something end-to-end. I knew there were connectors in there. Okay.

Dave VellanteAll right. Alex, can you bring up that previous slide again? So again, the premise that we're testing here, George and I have talked about this, George Gilbert and I, that increasingly there's an opportunity for companies doing modern data platforms to not only incorporate RAG, which they're doing, and even some agentic capabilities, but then there's this harmonization layer. Sometimes we call it the semantic layer. George, how do you see these three fitting in to this picture?

Well, along the lines of comparing it to the steel analogy, the integrated steel mills couldn't move up market in terms of, I should say up the stack. But here in the data platform, if you want to make the data all speak the same language so that, in the BI world that would be the metrics and dimensions, but when you want to do it across your application estate so that the data means the same thing to whatever analytics or whatever applications are looking at the data, that's a very complicated beast to solve. And we've talked, we've talked on the show, we've talked to Benoit at Snowflake about this. That's a significant step. And we've had Molham Aref from Relational AI and Dave Duggal from Enterprise Web.

It's almost like a new database that abstraction layer. So the data platforms can move up that layer. And in doing that, they provide integrated simplicity to application developers. And then that also enables RAG to work more meaningfully. Because right now RAG is still trying to use an LLM to make sense out of different chunks of data, but you also need a semantic layer, without getting into the details, for those two things to come together. That's what's called GraphRAG. And then the next layer above that is when LLMs can actually do things and invoke tools and they can take action without having to be pre-programmed for every sequence of steps that turns into agents and you need a multi-agent framework to organize, essentially an org chart of an army of agents. These are all layers that today's application platforms can grow into, that going back to our analogy, the steel mill could not. And this is what will be continuing to mine over the coming months and years as we talk about how the application platform evolves.

Dave VellanteYeah. And we've been talking about this. We have a picture of it. Let's revisit how we envision this modern data and application stack evolving to support what we're saying are intelligent data apps. A couple episodes back on Breaking Analysis, we laid out this vision of involving these intelligent data apps, the stack around it. A key point here is that, George mentioned the missing link is that harmonization layer. Again, semantic layer is sometimes how we refer to it, although it's maybe a misnomer. But in the upper right is that to-be created piece of real estate that represents a multi-agent platform. And we've reported that firms like Salesforce and Microsoft and Palantir are working on or evolving toward incorporating some of this functionality, but they're confined to their respective application domains, whereas firms like take a UI Path or a Solonis, and there are other emerging players that we're going to talk about that have an opportunity to transcend that single application domain and build horizontal multi-agent capabilities that span application portfolios and unlock that trapped value. But George, we've talked about new competitors that as these folks move up the stack, they'll encounter. What are your thoughts on that?

Yeah, that it's not a matter of Snowflake and Databricks duking it out. It's as the platforms move up, they encounter a whole range of new players. And I think we have a chart that we pinched from Insight Capital.

Dave VellanteYeah, let's bring that up.

<inaudible> player-

Dave VellanteIf we can. It's an eye chart from Insight, but it shows some of those emerging players within just-

On the agent.

Dave VellanteThe agentic space. The next one, Alex. Just bring that eye chart up. Yeah, that one. So we don't expect you to read this, but the point is there are a lot of potential players that can add value as partners or M&A targets. And George, you've done some research on this army of agents that are springing up in the market. Can you comment on what you've learned?

Yeah. I would just say that the definition of an application for decades was the database and the data model. The processes, the workflow, that was the application logic, and then the presentation logic. And that formed an island of automation. And the idea of redefining this with the harmonization layer is that you abstract all those islands and then you can put an agent framework around it all, and that agent framework will allow all these personas or functionally specific or specialized agents to collaborate in a larger enterprise wide context. That's the big challenge over the next five to 10 years.

Dave VellanteOkay. Let's wrap. George Frazier, we're going to give you the last word. Maybe you can comment on how the discussion today feeds into the idea that, well, you mentioned, hey, these modern data stacks platforms, they're sticky, but there's some tension and some pressure and some data that we've uncovered here that suggests that data lakes might be transformed in a way that many people are not talking about. Specifically, how do you see the emergence of these multiple execution engines interacting with open data tools and the changing nature of how much of that data work is going to be done in the future by these platforms? And give us your final thoughts on this narrative.

George FraserI think one of the non-obvious implications of the emergence of data lakes is that there will be more diversity of execution engines. We may see some workloads get pulled away from the integrated data platform, but at the same time, there's always new things to do, like many of the things that you just mentioned.

Dave VellanteGreat. All right, George, both Georges, thanks so much for joining us today in the program. Really appreciate it.

Thanks, Dave.

George FraserNice to be with you.

Dave VellanteAll right. And thanks Alex Meyerson and Ken Schiffman on production. They do our podcast. And Kristen Martin and Cheryl Knight help get the word out on our social media channels and in our newsletters. And Rob Hof is our EIC over at siliconANGLE.com. Remember. All these episodes, they're available as podcasts. All you got to do is search Breaking Analysis podcast wherever you listen. I publish each week on thecuberesearch.com and siliconANGLE.com. And you can email me at david.vellante@siliconANGLE.com or DM me at dvellante or comment on our LinkedIn post. You want to pitch us with some ideas, let us know. Check out etr.AI. They got great survey data, as we saw with the amazing progress that FiveTran is making. We can see timelines and trend data. It's incredible. Awesome enterprise tech survey research. This is Dave Vellante for theCUBEResearch Insights powered by ETR. Thanks for watching. We'll see you next time on Breaking Analysis.