SiliconANGLE theCUBESiliconANGLE theCUBE
  • info
  • Transcript
Clip #8: Efficiency and Cost Optimization in Flash-Based Storage Systems: Leveraging Persistent Memory and Advanced Compression Techniques".
Clip Duration 03:17 / December 9, 2023
Breaking Analysis: Moving beyond separation of compute & storage…Journey to the 6th data platform
Video Duration: 36:55
search

From theCUBE Studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR, this is Breaking Analysis with Dave Vellante. We believe today's so-called modern data stack as currently envisioned will be challenged by emerging use cases and AI-infused apps that begin to represent the real world in real time at massive data scale. Now, to support these new applications, a change in the underlying data and data center architectures, we think, will be necessary, particularly for exabyte-scale workloads. Today's generally accepted state of art that is separating compute from storage has to evolve, in our view, to separate compute from data so that compute can operate on a unified view of coherent data. Moreover, our opinion is that AI will be used to enrich metadata to turn strings, i.e., ASCII code files, et cetera, into things that represent real-world elements of a business. Hello and welcome to this week's theCUBE Research Insights powered by ETR. In this Breaking Analysis, George Gilbert and I continue our quest to more deeply understand the emergence of a sixth data platform that can support intelligent applications in real time, and to do so, we're very pleased to welcome two founders of VAST Data, CEO Renen Hallak and Jeff Denworth. Gentlemen, thanks for taking the time. Welcome. Thanks for having us. Thank you. Hey, by the way, congratulations on the recent news. In case you're in the audience, you haven't heard, VAST just closed a modest $118 million financing round that included Fidelity at a valuation of 9 billion, which implies a very minor change to the cap table by my math, so well done, you. Thank you. >> Okay. Let's start the conversation. We want to set a baseline on today's modern data platforms with some ETR data. Here we're showing data with net score or spending momentum on the vertical axis and presence in the data set, i.e., the N mentions in a survey of around 1700 IT decision-makers, on the horizontal axis. Think of it as a proxy for, you know, market presence. We're plotting what we loosely consider the five prominent modern data platforms, including Snowflake, Databricks, Google BigQuery, AWS, and Microsoft, and we also plot the database king Oracle as a reference point. That red line at 40%, anything above that is considered a highly elevated net score. Now, it's important to point out that this is the database, data warehouse sector in the ETR taxonomy.

So there's a lot of stuff in there that's not representative of a modern analytics and data platform, for instance, Microsoft SQL Server. That's a limitation of the taxonomy that you should be aware of but allows us to look at the relative momentum, and also, we're not focusing on operational platforms like MongoDB at this point in time. So the more important point we want to share is shown in the bottom right corner of the chart, and that is a diagram of what looks like a shared-nothing architecture. Now, in a shared-nothing architecture, each node in the system operates independently without sharing memory or storage with the other nodes. Scale, flexibility is the benefit, but ensuring coherence and consistent performance across these nodes of course is challenging. The modern data stack was built on shared-nothing architectures. Oracle and SQL Server originally were not. George, can you just add in the salient points here on the attributes of the modern data platform, and then we'll get into it. Okay, really quickly, the modern data platform assumed a scale-out, shared-nothing infrastructure, but we also moved it to stand for cloud-based software-as-a-service delivery, so it's managed. It's not Hadoop on prem, but then the pioneering vendors separated compute from storage to take advantage of that data center architecture, but typically, the compute and the storage are controlled by the same vendor, and we can't really separate data from compute until we have a lot of metadata that really puts all the intelligence about the data associated with the operational or analytic data itself, and that's what we're going to start to get into today. Okay, thank you, and let's do that now. Okay, so with that as background, we want to share a chart from VAST Data, which is shown right here. Guys, please share your point of view on the limitations of today's leading shared-nothing platforms, you know? The title of this slide, consistent writes are slow. Can you please add some color here? Sure, so, you know, the term of art in the market is shared nothing, talking about essentially a systems architecture that was first introduced to the world by Google in 2003. So it's about 20 years old, and that architecture, you know, as you and George just articulated, has the challenge of, you know, whereas the term is shared nothing, the challenge is that all of the nodes within a distributed system have to be kept in contact with each other. Typically for transactional I/O, where transactions have to be strictly ordered, that becomes a real challenge when these systems are also faced with scale, and so very rarely do you find systems that can scale to, you know, in today's terms, exabyte levels of capability but at the same time can deliver consistent performance as the systems grow and grow and grow just because you have, internally to these architectures, just a ton of east-west traffic, and so this is one of the major problems that we sought out to solve in what we've been doing. Okay, so one of the challenges customers have with their data is they have, you know, objects, tables, files that use different formats, like we're showing here, and they get, you know, different metadata, and generally this creates stovepipes. So this chart depicts a big global namespace, what you call in this chart a DataSpace, and the infinity loop from edge to on-prem data centers to the cloud, and the way you guys position it is your data platform integrates unstructured data management services with a structured data environment in a way that you can turn unstructured data into queryable and actionable information, which is really important. The reason this is important is 'cause it allows the disparate elements shown on this chart, objects, tables, files, et cetera, to become those things that we talked about earlier. So the question, gentlemen, is why can't today's modern data platforms do this, and what's your point of view on the architecture needed to accomplish this? I think they weren't designed for it. When these modern data platforms were built, the biggest you could think of was, as you say, strings. It was numbers. It was rows. It was columns of a database. Today we want them to analyze not numerical pieces of information but analog pieces of information, pictures and video and sound and genomes and natural language. These are a lot larger data sets by nature, and we're requiring much faster access to these data sets because we're no longer talking about CPUs analyzing the data. It's GPUs and TPUs, and everything has changed in a way that requires us to break those fundamental trade-offs between price and performance and scale and resilience in a way that couldn't be done before and didn't need to be done before, but now it does. If you think about just the movement that's afoot now with generative AI technologies like deep learning, for the first time in IT history, we now have tools to actually make sense of unstructured data, and so this is driving the coexistence within the VAST Data Platform of structured and unstructured data stores because the thinking is once you have the tools in the form of GPUs and neural networks to go and actually understand this data, well, that has to be cataloged in a place where you can then go and inquire upon the data and interrogate it and, to your point, Dave, you know, turn that data into something that's actionable for a business, and so the unstructured data market is 20 times bigger than the structured data market that, you know, most BI tools have been built on for the last 20, 30 years, and what we view is essentially AI as this global opportunity in the technology space to see roughly a 20X increase in the size of the big data opportunity now that there's a new data type that's processable. So let me follow up on one thing, Dave. What does the pipeline look like for training large language models that's different from what we might have done with today's modern data stack training deep learning models but where you had sort of one model for each, you know, task that you had to train on? What does that look like in data set size and how you curate it and then how it gets maintained over time? In other words, you're talking about a scale. One thing is you're talking about scale that's different, and then the other thing seems to be the integration of this data and this constantly enriched metadata about it and trying to unify that. Can you elaborate on where today's stack falls down? I think if you think in terms of scale, you know, for example, if you take the average Snowflake capacity, we did some analysis of the capacity that they manage globally. You divide it by the number of customers that they have, and, you know, you're somewhere between, depending upon when they're announcing, 50 to 200 terabytes of data per customer. Our customers on average manage over 10 petabytes of data. So you're talking about something that is, you know, at least 50 times greater in terms of the data payload size when you move from structured data to unstructured data. At the high end, we're working with some of the world's largest hyperscale internet companies, software providers who talk and think in terms of exabytes, and this is not all just databases that are being ETLed into a data lake. It's very, very rich data types that don't naturally conform to a data warehouse or some sort of BI tool, and so deep learning changes all of this, and everybody's conceptions of scale as well as real time really need to change. You know, if you go back to that earlier discussion we were having about shared-nothing systems, the fact that you can't scale and build transactional architectures has always led us to this state where you have databases that are designed for transactions, you have data warehouses that are designed for analytics, and you have data lakes that are designed for essentially cost savings, and that doesn't work in the modern era either because people want to infer and understand data in real time. If the data sets are so much bigger, then you need scalable transactional infrastructure, and that's why we designed the architecture that we did. Yeah, very good. So extending the original premise that we put forth right up front, we're kind of going back to the future into the shared-everything, scale-up architecture, and I want to explore that a little bit more. This is another chart from VAST presentation. It depicts many nodes with shared memory, which are those kind of little squares inside of the big squares at the bottom with access from all these connections over a very high-speed network, and the cubes represent compute. So the compute has access to all the data in the pooled memory and storage tier, which is being continually enriched by some AI and metadata magic, which we're going to talk about, but Renen and Jeff, explain your perspective on why we need scale up and shared everything to accommodate exascale data apps going forward. And if I can just add to that, why is it possible now? Yeah, yeah, how is it possible? Right. Yeah. It's possible because of fast networks and because of new types of media that is both persistent and fast and accessible through those fast networks, and none of these things could have been done at this level before we started, and that's another reason why we didn't see them before. Why do we need them? It's because of the scale limitations that we're reaching. It's a short blanket in the shared-nothing space. You can do larger nodes, and then you risk longer recovery times when a node fails. They can take months, in which case, you can't have another node fail without losing access to information, or you can have smaller nodes and a lot more of them, but then from another direction, you're risking failure.

Because you have more nodes, statistically they're going to fail more often. From a performance perspective also, as you add more nodes into one of these shared-nothing clusters, you see a lot more chatter. Chatter grows quadratically, and so you start to exhibit diminishing returns in performance. All of that limits the scale and resilience and performance of these shared-nothing architectures. What we did when we disaggregated, we broke that trade-off. We broke it in a way that you can increase capacity and performance independently but also such that you can increase performance and resilience up to infinity, again, so long as the underlying network supports it. Got it. So go ahead. You wanted to add something, Jeff? The network is what allows for disaggregation or the decoupling of persistent memory from CPUs, but the second thing that had to happen is some invention, and what we had to build was a data structure that lived in that persistent memory that allowed for all the CPUs to access common parts of the namespace at any given time without talking to each other. So it's basically a very metadata-rich transactional object that lives at the storage layer that all the CPUs can see and talk to in real time while also preserving atomic consistency, and so once you've done that, you can think of this architecture not as a kind of like a classic MPP system that most of the shared-nothing market kind of uses as a term to describe what they're doing but rather more as a data center-scale SMP system, where you just have cores and one global volume that all the cores can talk to in a coherent manner. So it's basically like a data center-scale computer that we've built. And that network is, my understanding is either it's InfiniBand or Ethernet, and it's got your IP to make the magic on top, right? Yeah. Yeah, most of our customers are just using standard commodity Ethernet. Interesting. Would it be fair to say, Dave, I would just want to drill down on something, like, the first reaction one might have is, well, you know, we've had tens if not hundreds of billions of investment in the scale-out data center architecture. Is it that you're drafting off the essentially rapid replacement of big chunks of that data center infrastructure for the LLM build out, that you're running on a lot of the, well, this new super fast network with a dense topology, that you're running on that because it's being replaced so rapidly or being installed so rapidly at the hyperscalers and elsewhere? I think LLMs are the first piece of this new wave. It's going to span way beyond language models and text, but yes, there is a new data center being built in these new clouds, in enterprises in a way that didn't exist before, and this new data center, its architecture perfectly matches our software architecture in the way that it looks. You have GPU servers on one side. You have DPUs in them in order to enable the infrastructure layer to be very close to the application in a secure manner, and then you have a fast network and SSDs in enclosures on the other end, and that's the entire data center. You don't need anything beyond that in these modern locations, and we come in and provide that software infrastructure stack that leverages the most out of this new architecture. Okay, I want to come back to this funny phrase that we used earlier, turning strings into things, and Jeff, you kind of alluded earlier to, and you guys have been talking about this, that we envision a future where, you know, the AI ultimately becomes a system of agency and can take action, and this is why it's so important to speak about things and not strings. So in previous episodes, we've explored the idea of Uber for all, ad we've had Uber on the program to explain how back in 2015, they had to go through these somewhat unnatural acts to create essentially a semantic layer that brought together transaction, analytic data, structured, unstructured and turned those database things that databases understand, ASCII code, objects, files, et cetera, into things that are a digital representation of the real world, in Uber's case, riders, drivers, locations, destinations, prices, et cetera.

So guys, in this example, we have the data up top. We got NFS and S3 or structured SQL in the form of files, objects and tables, which we can discuss in more detail, but our inference is that the metadata at the bottom level gets enriched by triggers and functions that apply AI and other logic over time, and this is how you turn, progressively turn strings into things. Is our understanding correct, and can you elaborate please? I think it is. The way our system is built, it's all data based. What do we mean by that? Data flows into the system, and then you run functions on that data, and those functions are very different than what used to be run on strings. These are inference functions on GPUs. They're training functions on GPUs. They enable you to understand what is within these things and then to ask intelligent questions on it because it's not just metadata that gets accumulated right next to the raw information. It's also queryable. All of this is new, and it brings, again, computers closer to the natural world rather than what was happening over the last two decades where we needed to adapt to the strings. Now computers are starting to understand our universe in a way that they couldn't before, and as you say, it will be actionable, But action is a byproduct of being able to make intelligent decisions. Well, I should say, you know, it should be a byproduct of that, and so this is why this architecture that we've built is really important, because what it allows for is the strings to be ingested in real time. As I mentioned, we built a distributed transactional architecture that, you know, typically you've never seen before because of the crosstalk problems associated with shared-nothing systems, but at the same time, if I can take that data and then query it and correlate it against all of the other data that I have in my system already, then what happens is I have real-time observability and real-time insight to flows that are running through the system in ways that you've never had before. Typically you'd have, like, an event bus that's capturing things and you'd have some other system that's used for analytics, and we're bringing this all together so that, you know, regardless of the data type, we can essentially start to kind of process and correlate data in real time. So George and guys, I think it's worth stopping for a second and taking an example that could be instructive. Take AWS. I mean, awesome, right? We're talking about a 90-plus billion dollar, you know, company, created the cloud. Its data stores, it's probably got, I dunno, 11, 12, 13 different database data stores, but they're very granular and by design, the piece parts, but when you think about the metadata, there's, at least today anyway, not a unified way to get your hands around all the metadata. You've got DataZones might have the business metadata. Glue might have, you know, technical metadata. They use different data stores, and so that's challenging for customers to basically create this new world where you've got intelligent data apps that are taking action, as we've just discussing. George, you and I have talked about this. Anything you would add to that? Yeah, I would. It's one, there seems to be a need to unify all the metadata or the intelligence about the data, but then maybe you can elaborate on your sort of roadmap for building the database functionality over time that's going to do what Jeff was talking about, which is observe changes streaming in and at the same time query historical context and then take action, what that might look like, and understanding that, as you've said, you built out the storage capabilities over time, and now you're building out the, you know, system level and application level functionality. Tell us, you know, what your vision for that looks like. Yeah, so this new era, the database starts as metadata. In fact, you can think of the old data platforms as having only metadata because they didn't have the unstructured piece. They didn't have those things in them, but the first phase is, of course, to be able to query that metadata using standard query language, which is what we came out with earlier this year. What the big advantage, of course, of building a new database on top of this new architecture is that you inherit that breaking of trade-offs between performance and scale and resilience. You can have a global database that is also consistent. You can have it extremely fast without needing to manage it, without needing a DBA to set views and indexes, and so what you get is the advantages of the architecture.

As we go up the stack and add more and more functionality, we will be able to displace not just data warehouses but also transactional databases, and in fact, we find a trade-off there between row-based and column-based traditional databases. You don't need that when the underlying media is 100% random access. You don't need ETL functions in between to maintain multiple copies of the same information because you want to access it using different patterns. You can just, again, give different viewpoints or mount points into the same underlying data and allow a much simpler environment than was possible before. Interesting. So you got a 20X increase in big data opportunity. There's a TAM increase there, as well as, you know, potentially eating away at some of the traditional methods. I want to talk more about data management and how data management connects to new data center architectures. We're showing here on this chart, shared persistent memory eliminates the historical trade-offs. We've touched a little bit upon that, but let's go deeper. So using shared persistent memory instead of slow storage for writes in combination with a single tier of, let's say, all flash storage brings super low latency for transactional writes and really high throughput for analytical reads. So this ostensibly eliminates the many-decades-long trade-off between latency and throughput. So guys, I wonder if you could, again, comment and add some color to that. And maybe one other thing before you start, Jeff, which was something we missed when we were first looking at Snowflake, which was, you know, they are now claiming to add unstructured data, just as Oracle did, you know, many, many years ago, but there was a cost problem, and the one thing, you know, the one thing that you've talked about, the scale issue, but I don't know if you guys have done any cost comparisons when you're at that 20X scale. Sure. Okay. So to start, in terms of the data flow, as data flows into the system, it hits a very, very low latency tier of persistent memory, and then as the, what we call the write buffer fills up, we take that data and then we migrate it down into an extremely small columnar object. If you're building a system in 2023, it doesn't make sense to accommodate for spinning media, and if you have an all-flash data store, the next consideration is, well, why would you do things like classic, you know, data lakes and data warehouses have been built around for streaming. With flash, you get high throughput with random access, and so we built a very, very small columnar object. It's about, I don't know, about 4,000 times smaller than, let's say, a Iceberg row group, and that's what gets laid down into SSDs that is then designed for extremely fine-grained selective query acceleration, and so you have both, and, you know, about 2% of the system is that write buffer.

The remaining 98% of the system is that columnar-optimized random-access media, and then when I put the two together, you can just run queries against it and you don't really care. So if some of the data that you're querying is still in the buffer, of course, you're reading on a row basis, but it's the fastest possible persistent memory that you're reading from, and so in this regard, even though it's not a data layout that's optimized for how you would think queries naturally should happen, it turns out that we can still respond to those data requests extremely fast time thanks to persistent memory. Now, to the the point I think you were getting to, George, around cost, we started with the idea that we could basically bring an end to the hard drive era, and every decision that we make from an architecture perspective presumes that we have random-access media as the basis for how we build infrastructure, but cost is a big problem, and then flash still carries a premium over hard drive-based systems, and that's why most data lakes today still optimize for hard drives, but people weren't really thinking about all that could be done from an efficiency perspective to actually reconcile the cost of flash such that you can now use it for all of your data, and so there's a variety of different things that we do to bring a supernatural level of efficiency to flash infrastructure.

One of those is basically a new approach to compression which allows us to look globally across all the blocks in the system but compress at a level of granularity that goes down to just two bytes, and so it's the best possible way to find patterns within data on a global basis and is designed to be insensitive to noise within data so that we'd always find the best possible reductions. Typically our customers see about 4:1 data reduction, which turns out is greater than the delta between flash and hard drives in the marketplace today. So it's completely counterintuitive, but the only way to build a storage system today that is cheaper than a hard drive-based data store is to use flash because the way that you can manipulate data allows you to find a lot more patterns and reduce data down in ways that would just not be sensible to do with spinning media. Right. Okay, thank you for that explanation, and then, Ken, if you bring up the last slide and set of talking points, I want to just summarize and then, you know, give you guys the last word. You know, VAST, we see VAST as a canary in the coal mine, and what we mean by that is this new approach to data center architectures shines a light on the limitations of today's data platform. So of course, the big question over time is, you know, is VAST a replacement of or a complement to the sixth data platform? And the second part of that question is, okay, how are the existing data platforms going to respond? And we think the determining factor is going to be the degree to which companies like VAST can evolve their ecosystem of ISVs to build out the stack. Of course, the starting point, as we heard earlier, is really a full-blown, you know, analytic DBMS, but I'll turn it over to Renen, Jeff, George. You know, take your last shot. Maybe just Renen, since I'm always adding, you know, a question from the cheap seats before you get going, my question would be, you know, as you grow into a richer, you know, aspiring replacement to the full modern data stack, customers are still going to want to take their tools, their, you know, the pipelines that, you know, even though you bring data in without ETL, you got to still take raw data and turn it into refined data products, and, you know, those pipelines, the observability, the quality rules, they have tools they like. How do you expect or intend to accommodate that sort of migration over time? I think in the same way that we have done until today. We support standard protocols, we support standard formats, and we support multi-protocols and multi-formats. For example, if you look back to our storage days, you're able to write something via a file interface and then read it through an object. That allows you to continue working in the same way that you did but to move into the new world of cloud-native applications. The same is true at the database layer. Us providing the standard formats and protocols allows you to start using the VAST platform without needing to change your application, but over time, enables you to do things that you couldn't do before. Back to Dave's question about does it start with a DBMS, I think the last wave of data platforms did because it was focused on structured data.

I think the next wave of data platforms, I like to use the analogy of the operating system. It's no longer just a data analytics machine. It's a platform that provides that infrastructure layer on top of new hardware and makes it easy for non-skilled enterprises and cloud providers to take advantage of these new AI abilities, and that's what we see in the future, and that's what we're trying to drive in the future. Jeff, I got to hear from you. You got to give us your last thoughts on all this discussion please. Yeah, I think we generally think that, you know, solving old problems isn't that fun, and so if you think about what's happening now, people are now trying to bring AI to their data, and there's a very specific stack of companies that are starting to emerge, and you talked a little bit about our funding. We went and did some analysis of the companies that have increased their valuation by over 100% this year, of real-size companies, so companies that are valued over $5 billion, and it's a very, very select number of organizations. At the compute layer, it's Nvidia. At the cloud layer, it's a new emerging cloud player called CoreWeave which is hosting a lot of the biggest large language model training. In inference applications, it's us, it's Anthropic, and it's OpenAI, and that's it. So something is definitely happening, and we're trying to make sure that we can not only kind of satisfy all these new model builders but also take these same capabilities to the enterprise so that they can all benefit. Yeah, I dunno if it was your chart or somebody else's chart. I saw that recently. It's various bars, and most of 'em were under the zero line. So thank you for your last thoughts. George, I'll give you the final word. I think this was kind of enlightening for us to see something that shines a light from a perspective, you know, from a new perspective on the shared-nothing architecture, which we've grown over two decades to take for granted, and once you have to revisit that assumption, when you can think about scale up is back instead of scale out, then everything you build on top of that can be rethought, and so VAST is helping us question those assumptions, but I thought it was particularly interesting to see, you know, the idea that we need to unify files, objects, and tables as sort of one unified namespace, but that there's no real difference between data and metadata. It all kind of blurs as you turn these strings into things, and the one thing, you know, that we keep harping on is that ultimately, you know, when you turn strings into things, there is a semantic layer that has to represent these things richly and the relationships between them, and that's what's going to allow us to separate compute from data, you know, where you can have composable applications talking to a shared infrastructure, and it sounds like you're building towards that vision. I remember last June, Jeff, watching you. I think it was you and John Furrier talking at the Databricks event as part of our editorial, and you were saying, you know, "We're not really just just a storage company," and I remember saying, "Hmm, what does that mean?" So we're starting to see that come into focus. Guys, it's so helpful to have folks like you that are articulate, technically oriented, but also business oriented to help our community understand what the future's going to look like, so really appreciate your time today. Thank you. Thank you. Thank you. And I want to thank Ken Shifman, who's flying solo today. Alex Myerson's also on production and manages the podcast. Kristen Martin and Cheryl Knight, they help get the word out on social media and in our newsletters, and Rob Hof is our editor in chief over at siliconangle.com. Remember, all these episodes are available as podcasts. Wherever you listen, just search Breaking Analysis podcast. I publish each week on wikibon.com, which is being rebranded to thecuberesearch.com, and siliconangle.com, and you can email me at david.vellante@siliconangle.com or DM me @dvellante if you want to get in touch or comment on our LinkedIn posts, and do check out etr.ai. They get great survey data in the enterprise tech business. This is Dave Vellante for George Gilbert and theCUBE Research Insights powered by ETR. Thanks for watching, everybody, and we'll see you next time on Breaking Analysis. (upbeat music)