SiliconANGLE theCUBESiliconANGLE theCUBE
  • info
  • Transcript
How @SnowflakeDB approaches open source...example of Apache Iceberg
Clip Duration 01:52 / August 10, 2022
Super Data Cloud | Supercloud22
Video Duration: 17:30
search

(electronic music) Welcome back to our studios in Palo Alto, California. My name is Dave Vellante, I'm here with John Furrier, who is taking a quick break. You know, in one of the early examples that we used of so called super cloud was Snowflake. We called it a super data cloud. We had, really, a lot of fun with that. And we've started to evolve our thinking. Years ago, we said that data was going to form in the cloud around industries and ecosystems. And Benoit Dogeville is a many time guest of theCube. He's the co-founder and president of products at Snowflake. Benoit, thanks for spending some time with us, at Supercloud 22, good to see you. Thank you, thank you, Dave. So, you know, like I said, we've had some fun with this meme. But it really is, we heard on the previous panel, everybody's using Snowflake as an example. Somebody how builds on top of hyper scale infrastructure. You're not building your own data centers. And, so, are you building a super data cloud? We don't call it exactly that way. We don't like the super word, it's a bit dismissive. That's our term. About our friends, cloud provider friends. But we call it a data cloud. And the vision, really, for the data cloud is, indeed, it's a cloud which overlays the hyper scaler cloud. But there is a big difference, right? There are several ways to do this super cloud, as you name them. The way we picked is to create one single system, and that's very important, right? There are several ways, right. You can instantiate your solution in every region of the cloud and, you know, potentially that region could be AWS, that region could be GCP. So, you are, indeed, a multi-cloud solution. But Snowflake, we did it differently.

We are really creating cloud regions, which are superimposed on top of the cloud provider region, infrastructure region. So, we are building our regions. But where it's very different is that each region of Snowflake is not one instantiation of our service. Our service is global, by nature. We can move data from one region to the other. When you land in Snowflake, you land into one region. But you can grow from there and you can, you know, exist in multiple cloud at the same time. And that's very important, right? It's not different instantiation of a system, it's one single instantiation which covers many cloud regions and many cloud provider. So, we used Snowflake as an example. And we're trying to understand what the salient aspects are of your data cloud, what we call super cloud. In fact, you've used the word instantiate. Kit Colbert, just earlier today, laid out, he said, there's sort of three levels. You can run it on one cloud and communicate with the other cloud, you can instantiate on the clouds, or you can have the same service running 24/7 across clouds, that's the hardest example. Yeah. The most mature. You just described, essentially, doing that. How do you enable that? What are the technical enablers? Yeah, so, as I said, first we start by building, you know, Snowflake regions, we have today 30 regions that span the world, so it's a world wide system, with many regions. But all these regions are connected together. They are meshed together with our technology, we name it Snow Grid, and that makes it hard because, you know, Azure region can talk to a WS region, or GCP regions, and as a user for our cloud, you don't see, really, these regional differences, that regions are in different potentially cloud. When you use Snowflake, you can exist, your presence as an organization can be in several regions, several clouds, if you want, geographic, both geographic and cloud provider. So, I can share data irrespective of the cloud. And I'm in the Snowflake data cloud, is that correct? I can do that today? Exactly, and that's very critical, right? What we wanted is to remove data silos. And when you insociate a system in one single region, and that system is locked in that region, you cannot communicate with other parts of the world, you are locking data in one region. Right, and we didn't want to do that. We wanted data to be distributed the way customer wants it to be distributed across the world. And potentially sharing data at world scales. Does that mean if I'm in one region and I want to run a query, if I'm in AWS in one region, and I want to run a query on data that happens to be in an Azure cloud, I can actually execute that? So, yes and no. The way we do it is very expensive to do that. Because, generally, if you want to join data which are in different region and different cloud, it's going to be very expensive because you need to move data every time you join it. So, the way we do it is that you replicate the subset of data that you want to access from one region from other region. So, you can create this data mesh, but data is replicated to make it very cheap and very performing too. And is the Snow Grid, does that have the metadata intelligence to actually? >> Yes, yes. Can you describe that a little? Yeah, Snow Grid is both a way to exchange metadata. So, each region of Snowflake knows about all the other regions of Snowflake. Every time we create a new region, the metadata is distributed over our data cloud, not only region knows all the region, but knows every organization that exists in our cloud, where this organization is, where data can be replicated by this organization. And then, of course, it's also used as a way to exchange data, right? So, you can exchange data by scale of data size. And I was just receiving an email from one of our customers who moved more than four petabytes of data, cross region, cross cloud providers in, you know, few days. And it's a lot of data, so it takes some time to move. But they were able to do that online, completely online, and switch over to the other region, which is very important also. So, one of the hardest parts about super cloud that I'm still trying to struggling through is the security model. Because you've got the cloud as your sort of first line of defense. And now we've got multiple clouds, with multiple first lines of defense, I've got a shared responsibility model across those clouds, I've got different tools in each of those clouds. Do you take care of that? Where do you pick up from the cloud providers? Do you abstract that security layer? Do you bring in partners? It's a very complicated. No, this is a great question. Security has always been the most important aspect of Snowflake sense day one, right? This is the question that every customer of ours has. You know, how can you guarantee the security of my data? And, so, we secure data really tightly in region. We have several layers of security. It starts by creating every data at rest. And that's very important. A lot of customers are not doing that, right? You hear of these attacks, for example, on cloud, where someone left their buckets. And then, you know, you can access the data because it's a non-encrypted. So, we are encrypting everything at rest. We are encrypting everything in transit. So, a region is very secure. Now, you know, from one region, you never access data from another region in Snowflake. That's why, also, we replicate data. Now the replication of that data across region, or the metadata, for that matter, is really our least secure, so Snow Grid ensures that everything is encrypted, everything is, we have multiple encryption keys, and it's stored in hardware secure modules, so, we bit Snow Grid such that it's secure and it allows very secure movement of data. Okay, so, I know we kind of, getting into the technology here a lot today, but because super cloud is the future, we actually have to have an architectural foundation on which to build. So, you mentioned a bucket, like an S3 bucket. Okay, that's storage, but you also, for instance, taking advantage of new semi-conductor technology. Like Graviton, as an example, that drives efficiency. You guys talk about how you pass that on to your customers. Even if it means less revenue for you, so, awesome, we love that, you'll make it up in volume. And, so. >> Exactly. How do you deal with the lowest common denominator problem? I was talking to somebody the other day and this individual brought up what I thought was a really good point. What if we, let's say, AWS, have the best, silicon. And we can run the fastest and the least expensive, and the lowest power. But another cloud provider hasn't caught up yet. How do you deal with that delta? Do you just take the best of and try to respect that? >> No, it's a great question. I mean, of course, our software is extracting all the cloud providers infrastructure so that when you run in one region, let's say AWS, or Azure, it doesn't make any difference, as far as the applications are concerned. And this abstraction, of course, is a lot of work. I mean, really, a lot of work. Because it needs to be secure, it needs to be performance, and every cloud, and it has to expose APIs which are uniform.

And, you know, cloud providers, even though they have potentially the same concept, let's say block storage, APIs are completely different. The way these systems are secure, it's completely different. There errors that you can get. And the retry mechanism is very different from one cloud to the other. The performance is also different. We discovered that when we starting to port our software. And we had to completely rethink how to leverage block storage in that cloud versus that cloud, because just off performance too. And, so, we had, for example, to stripe data. So, all this work is work that you don't need as an application because our vision, really, is that application, which are running in our data cloud, can be abstracted for this difference. And we provide all the services, all the workload that this application need. Whether it's transactional access to data, analytical access to data, managing logs, managing metrics, all of this is abstracted too, so that they are not tied to one particular service of one cloud. And distributing this application across many region, many cloud, is very seamless. So, Snowflake has built, your team has built a true abstraction layer across those clouds that's available today? It's actually shipping? Yes, and we are still developing it. You know, transactional, Unistore, as we call it, was announced last summit. So, they are still, you know, work in progress. >> You're not done yet. But that's the vision, right? And that's important, because we talk about the infrastructure, right. You mention a lot about storage and compute. But it's not only that, right. When you think about application, they need to use the transactional database. They need to use an analytical system. They need to use machine learning. So, you need to provide, also, all these services which are consistent across all the cloud providers. So, let's talk developers. Because, you know, you think Snowpark, you guys announced a big application development push at the Snowflake summit recently. And we have said that a criterion of super cloud is a super paz layer, people wince when I say that, but okay, we're just going to go with it. But the point is, it's a purpose built application development layer, specific to your particular agenda, that supports your vision. >> Yes. Have you essentially built a purpose built paz layer? Or do you just take them off the shelf, standard paz, and cobble it together? No, we build it a custom build. Because, as you said, what exist in one cloud might not exist in another cloud provider, right. So, we have to build in this, all these components that a multi-application need. And that goes to machine learning, as I said, transactional analytical system, and the entire thing. So that it can run in isolation physically. And the objective is the developer experience will be identical across those clouds? Yes, the developers doesn't need to worry about cloud provider. And, actually, our system will have, we didn't talk about it, but a marketplace that we have, which allows, actually, to deliver. We're getting there. Yeah, okay. (both laughing) I won't divert. No, no, let's go there, because the other aspect of super cloud that we've talked about is the ecosystem. You have to enable an ecosystem to add incremental value, it's not the power of many versus the capabilities of one. So, talk about the challenges of doing that. Not just the business challenges but, again, I'm interested in the technical and architectural challenges. Yeah, yeah, so, it's really about, I mean, the way we enable our ecosystem and our partners to create value on top of our data cloud, is via the marketplace. Where you can put shared data on the marketplace. Provide listing on this marketplace, which are data sets. But it goes way beyond data. It's all the way to application. So, you can think of it as the iPhone. A little bit more, all right. Your iPhone is great. Not so much because the hardware is great, or because of the iOS, but because of all the applications that you have. And all these applications are not necessarily developed by Apple, basically. So, we are, it's the same model with our marketplace.

We foresee an environment where providers and partners are going to build these applications. We call it native application. And we are going to help them distribute these applications across cloud, everywhere in the world, potentially. And they don't need to worry about that. They don't need to worry about how these applications are going to be instantiated. We are going to help them to monetize these applications. So, that unlocks, you know, really, all the partner ecosystem that you have seen, you know, with something like the iPhone, right? It has created so many new companies that have developed these applications. Your detractors have criticized you for being a walled garden. I've actually used that term. I used terms like defacto standard, which are maybe less sensitive to you, but, nonetheless, we've seen defacto standards actually deliver value. I've talked to Frank Slootman about this, and he said, Dave, we deliver value, that's what we're all about. At the same time, he even said to me, and I want your thoughts on this, is, look, we have to embrace open source where it makes sense. You guys announced Apache Iceberg. So, what are your thoughts on that? Is that to enable a developer ecosystem? Why did you do Iceberg? Yeah, Iceberg is very important. So, just to give some context, Iceberg is an open table format. Right. Which was first developed by Netflix. And Netflix put it open source in the Apache community. So, we embraced that open source standard because it's widely used by many companies. And, also, many companies have really invested a lot of effort in building big data, Hadoop Solutions, or DataX Solution, and they want to use Snowflake. And they couldn't really use Snowflake, because all their data were in open format. So, we are embracing Iceberg to help these companies move through the cloud. But why we have been reluctant with direct access to data, direct access to data is a little bit of a problem for us. And the reason is when you direct access to data, now you have direct access to storage.

Now you have to understand, for example, the specificity of one cloud versus the other. So, as soon as you start to have direct access to data, you lose your cloud data sync layer. You don't access data with API. When you have direct access to data, it's very hard to sync your data. Because you need to grant access, direct access to tools which are not protected. And you see a lot of hacking of data because of that. So, direct access to data is not serving well our customers, and that's why we have been reluctant to do that. Because it is not cloud diagnostic. You have to code that, you need a lot of intelligence, why APIs access, so we want open APIs. That's, I guess, the way we embrace openness, is by open API versus you access, directly, data. iPhone. Yeah, yeah, iPhone, APIs, you know. We define a set of APIs because APIs, you know, the implementation of the APIs can change, can improve. You can improve compression of data, for example. If you open direct access to data now, you cannot evolve. My point is, you made a promise, from governed, security, data sharing ecosystem. It works the same way, so that's the path that you've chosen. Benoit Dogeville, thank you so much for coming on theCube and participating in Supercloud 22, really appreciate that. >> Thank you, Dave. It was a great pleasure. All right, keep it right there, we'll be right back with our next segment, right after this short break. (electronic music)