- info
- Transcript

John FurrierWelcome back everyone for day two of TheCUBE. We're live here in Atlanta for Supercomputing 2024 or SC24. I'm John Furrier with TheCUBE, with Dave Vellante, Savannah Peterson, Chris McCall whole team is here. Wall to wall covered with the SiliconANGLE and TheCUBE. We've got a great panel here talking about the AI factor. We've got some CUBE alumnis here. We got Arun, who's the SVP of Portfolio Management from Dell. Welcome back. Good to see you. And we got Hassan, who's the Head of Software Products at Ecosystem at Broadcom and Vasali, who's the Executive Officer of Global Ecosystem Partners at Denvr Dataworks, all returning back to TheCUBE. Great to see you guys. Thanks for coming on.Thank you for having-So AI factory's hot. First of all, Dell has an amazing trailer. They brought the factory. Their booth has got the great product, so congratulations. The machines look phenomenal. They look great. They got all the great gear in them. I think it's going to be a top sell, so congratulations.Thank you.Broadcom, again, been doing the magic with the chips. Thanks for what you do and talk about that as well. In Denvr, you guys doing large-scale factory. So let's get into it. You got a little prop here, which I love. You brought the factory model. The model home, is it that you call it? Tell us what you got here.Yeah, I brought my private zone Denvr Dataworks private zone, which is a modular super cluster. It can house 128 nodes times eight, so more than 1,000 GPUs in an extremely efficient footprint, less than 900 square feet, 1.5 megawatt power and water free with liquid immersion cooling. We can also do liquid to the chip and our goal here is to work more in partnership with Dell and Broadcom to populate the GPU servers as well as networking and storage racks. So what you see here is the liquid immersion tanks and then what you see here is the networking and storage racks and that are all going to be populated with Dell and Broadcom networking hopefully soon.What I love about this wave we're in right now is last year we saw the excitement and the confidence that AI now and HPC are coming together. That's clear. Now you're starting to see the proof points of the hardware or systems that are now out and now the components and the footprint is here. So take us through the benefits of this. What's changed from last year to this year and why is this on the roadmap? Why is this happening? Is it footprint? Is it all the efficiencies? What's the main driver for the new change of the factory?Yeah, and I think you set it out very nicely. Customers are looking for more open architectures and also they want to make sure that they can have total cost of ownership savings when it comes to power cooling and the amount of space that is required. We all know that everybody's running out of data center power and space. So how do you figure out... And most of the AI workloads are very power-hungry. So how do you figure out how to deploy AI workloads in the fastest possible manner in the lowest TCO with lowest power and lowest cooling requirement? So that is exactly why we designed our Denvr Dataworks private zone in partnership with Broadcom and Dell so that we can give customer different choices and options as well as open architecture. Liquid immersion cooling as well as liquid to the chip cooling really results in the efficient power usage as well as compact footprint.
Our goal is to provide customers a modular data center campus. So we can actually offer four of these together to create a large data center campus in less than six months. So think about it like how fast we can help enterprise deploy this AI workloads in very limited time.Yeah, what's exciting is that you mentioned AI workloads. The characteristics of those workloads today are demanding the new capabilities, which you're showing. Arun, Hassan, this is also another area that you guys are working on. That's another problem area, which is I need less power, more performance on the machines and I got energy to deal with. So space is limited and power and cooling is a huge constraint and I want more horsepower. I want more speeds and fees. I want price performance. These are the three hardest problems that I would say summarize this. Could you guys talk about how that factors because the workloads are hungry? They're coming on fast.Yep. So let me tell you what we are doing at Dell. I mean clearly rack power, if you look at the previous generations, they were like 20 kilowatt racks. I mean the average enterprise is 20 kilowatt rack. The first generation of AI with a 96.80 was a 50 kilowatt rack. Now with GB 200 and the next generation, they're be greater than a hundred kilowatt racks. There is no way air cooling can do that level of cooling for a hundred kilowatt racks. We need liquid cooling direct to chip, liquid cooling. We need open standards, which means that both networking and servers all have to fit into a rack scale infrastructure. So a data center like what she's showing there will be an amazing place for the rack scale infrastructure to work. Efficient power, efficient performance. That's what we're seeing right now in the industry.Solving those problems is key. Networking has come up a lot. Hassan, Broadcom, you guys have been nailing the efficiency side of it and also the energy requirements.Yeah, so John, so what I'm holding out here is a Tomahawk 5. It's a 51-Surprise for the prompt. Hold on. Zoom in on that, Tony.There you go. So this is billions of transistors, 51.2 terabit. Dell has been a great partner, building systems based on them being one of the first to market with this. Now this is at least a generation ahead of anybody else out in the industry. So think of this as if you're building systems based on this, when Arun's team is building systems based on this, they can replace six previous generation with one. So that's about a 75% savings and power space cooling that you can get with this. Now this also couples its connectivity into the server is we have a portfolio of AI NICs that we've introduced. That NIC is an 18 watt NIC. It's called 400 gig NIC, it's called Tor two. That's the lowest power in the industry. Then to connect both of these, we have the best studies out there in the industry. So what you can do with this is you can get four meter reach over copper.
This means that when you are connecting within the rack across racks, you don't need to use optics which consume a lot of power. You save on power, you save on cost. And lastly, I would say we have other major investments. For example, we are investing in something called linear drive optics. These are optics where we remove the DSPs less electrical components. So less power usage, lower cost, but also very reliable. And we also are showing on the show something called co-package optics. This is something that we have invested in multiple generations, but as systems become more dense, you have to put more optics in there. You can now co-package these optics and you can get up to 70% savings from a power perspective. So absolutely, this is top of mind for us when we are doing this work, when we are working with our partners, like with Dell and Denvr.It's interesting you mentioned the components. I saw your booth, but the fiber chips got all that innovation in, it's like a super chip. And it just takes so much more complexity out of it from a component standpoint, but also energy, the efficiency. So this is kind of the innovation. To me, this panel encapsulates the show because you got real engineering innovation going on, happening, the fruits coming off the tree in this wave right now, and then the workloads need to run it. So if you zoom out, you say, "Okay, the engineering is getting done. The workloads now have form factors." So the three problems is space, energy and price performance, all hitting. That's pretty much what's happening here. Now the workloads, what's going on in the workloads?I will start. So we see some of the top use cases like training as a service, inference as a service, model as a service, RAG as a service. And I think networking is extremely important in all three use cases that I just described. You don't want network to ever come into the way of applications or workloads. And I feel like with AI, networking has become glamorous again. I feel like it's back to the nineties when networking was just taking off... And we have worked with Broadcom pretty much... I've been in the industry for more than 27 years and we have worked with Broadcom at my previous employers and I think it's just amazing to see how the whole cluster network and fabric need to come together to make sure that there is low latency, super high throughput that really leverages the power of the compute that is available from all these powerful GPUs from NVIDIAs and others.You mentioned back to the nineties, that's come up a lot. And again, not to date myself, but I remember when open systems really kicked in the OSI model, but really that was TCPIP, then all the standards below that kind of kicked in. A real wave of innovation happened, but one, it's created a whole... It's just a whole other thing happened. But what's really interesting was new things kept coming out. More engineering. So you saw more revisions. So what was key about that is everything worked downstream with each other. So you bought something earlier, the next thing works with it, but it got better. So it was... I won't say leapfrog, but it was just an advancement in price performance. We're living that now. What you guys are doing, we're seeing it from last year to this year. Talk about that innovation because there's real innovation going on. You mentioned networking, so everything's being innovated. So can you guys talk about the Dell-Broadcom piece because there'll be something next year?
You mentioned that you're ahead of the competition because of the engineering. So this is like no BS time, no BS time. This is like get down and dirty and get it engineering done.I'll start with then and Hassan can talk. So we are working really closely with Broadcom. I'll say a few key tenets. The Dell-Broadcom partnership is about open systems. We want ethernet as the technology of choice. We believe ethernet will be the technology of choice for AI networking. But ever-increasing speeds in bandwidth. I mean, so right now we have 800 gig NICS. We have 800 gig switches. We are getting to 1.60. I mean, that's where we are working closely with Broadcom. We want to be time-to-market, first-to-market, but we want interoperability. You buy an asset today, you want that asset to live for 18 months to 24 months. The next asset comes along, it has to be interoperable. Our technology innovation is in the latest and greatest, 1.6, Tomahawk six that's coming out later on. We are working very closely with Broadcom to partner with that.So just to piggyback off Arun... And then the workloads. This workload is very unique. It's different characteristics. I think this is why Denvr had the vision to say, "Look, a regular cloud cannot just do GP as a service as is because the workload is different in its characteristics. You have very few flows, there are very large flows. If you're doing training, these jobs run for a very long time." So from customers, what we have heard is initially look, "Hey, 57 to 60% of the time is spent in networking. I spent 2 billion on this infrastructure and I can't use it 60% of the time. I bought this brand new Ferrari and you are telling me to park this in the garage 20 out of the 30 days." But how do you solve these problems?
How do you have advanced load balancing capabilities? How do you do congestion control? How do you do recover from failures? Recovering from failures is a very important aspect, in addition to density and performance. And what we have done is working with Dell, seeking feedback from people like Denvr, incorporated these capabilities which Dell has now introduced as systems, and we are in these conversations. How are we going to take it to the next level? And as you see a couple of years out, you'll see products come out that will reflect that.And then the open's critical. I think again, back to the customers who want to run the workloads, we're seeing this, "Okay, transformers came out. Thank you very much." That takes all the what I call old AI algorithm, levels them up. And then after the transformers hit, the architecture change. We're seeing you guys are doing that and now the software gets better. So you have new software and then the architecture's going to change and then now new software's coming out. So software's driving it and anyone doing GenAI has to have the software asset driving everything. We know that. We've been talking about it theCUBE all day long. But now when you get to the performance of the hardware or system now, it's a system. It's not just a box. This is a key point. So everyone wants to know, "Do I use liquid cooling here? Do I do power cooling on chip there? Do I use air cooling here?" You don't have to have one thing. You got to make a selection. So the successful people are building the systems and then connecting it to the cloud.
So it's not about the cloud versus on-prem. It's like, what's the system? So we're in a systems revolution. Again, back to the nineties. It's not just networking, it's everything. So can you guys opine on that? Because I think this is a point everyone in the customer base that we talk to is... Not struggling with, but they're working hard to figure out, "Okay, what's my system architecture look like? Because I'm going to tie business value to it because this isn't just get email."Yeah.Yeah. Let me start. Great point. So I'll tell we do a lot of the big deals right now. I mean it has gone... I mean we were selling boxes like 18 months ago, two years ago. Now the real conversation is what is a rack scalable system? How does the server, how does the storage, how does the network all connect together, perform at the peak capacity? I mean, Hassan is absolutely right. If you build a $4 million rack and you're sitting on a network that's not optimized, it doesn't work. so customers are asking to us at Dell is, "Get me a rack scalable system that is fully performance-tuned and optimized out of the factory that I can plug into a modular data center like that and plug and play." Time to value is about optimizing the system. And that's what we are working on right now.This is where we work very closely with Dell in this partnership that Dell is extremely good at solutions. So we know that when people are rolling this out, when Denvr is rolling this out, they don't want, "Here is a switch, here is a NIC, here is storage, here is some operating system." It's the entire thing. And what we have done is working with Dell, there are validated designers that are now available where how do you build the end-to-end system and it's completely validated from a hardware software optics perspective that can just be consumed off the bat.Go ahead.And I think that's very important. So what Dell and Broadcom provides, we create a full software AI stack on top of it, which is very, very important. Because again, We want to make sure that customers are benefiting from the open architecture, the sonic networking operating system, and also other open architectures in the AI stack. So starting from the optimizing of the data center for power cooling and space, then layering the networking and storage fabric on top of it, which is based on the open standards. Then doing platform orchestration, which is based on Kubernetes and et cetera. And then providing rest APIs to the customer to bring that AI workloads to address their key business outcomes like employee productivity enhancement, manufacturing improvements, retail insights, what customers are buying, healthcare, drug discovery, acceleration and whatnot. So I think it all goes together from the software open architecture perspective with optimized power space and cooling.Yeah, what's really interesting, what you guys are doing is so fascinating because the old expression, be careful what you ask for, you might get it. Well, guess what? We're getting it. It's party time in tech. I did that on the pod, a whole segment on our podcast is party time... Every layer of the stack is happening. And you mentioned networking is not just a box and servers are networking. It's networking in what you're building. And so you have an opportunity to build a system that you want and you can do it right. But it's not so much out of the box like the old days, but it is being built. And I think that's the key. When we saw the IT build out in that first wave was rack and stack. You have built it out here. It's a similar concept but it's a system. Take us through how you guys did it and what are some of the successes and how do you make the choices? Do I use this for that? Because once you get it, the components with networking with Broadcom, you get the Dell systems, how do you build it?Yeah, that's a great question. I think it is very, very important to keep in mind what customers are looking to do with their generative AI. So working backwards from that is always very, very helpful. So we learn from our customers that they are looking for employee productivity enhancements, code generation, as I mentioned, financial services, risk management, fraud detection, things like that. So now you take those business outcomes and map them to the key use cases, which is training as a service or inference as a service. Many of them would start from the open source LLMs out there and then they want to fine tune and then add RAG, Retrieval-Augmented Generation to it. So we understood all of those requirements very clearly from our customers and partners and making sure that what we are building at Denvr Dataworks with our Denvr cloud, it gives them actuality, choice, and flexibility with right scale to make it their own right.
And that was very, very important for us to create this optimized data center footprint as well as create that software AI stack that I just talked about so that it meets all of those business outcomes and use cases that I talked about. And we also have very talented team coming from enterprise, banking environment as well as our CEO who has done multiple different startups before. And then all the other engineers as well as the leadership team, they have tremendous AI and the data center background. So people do matter. So you want to make sure that when you have a great idea, you get the right people to build the right product together.It's a builder culture right now. And again, that's why I like that back to the nineties comment. I want to ask you a specific question around the disruption enablement. Now I used that word very specifically intentionally. It's disrupting, but it's enabling. You saw that with TCPIP and networking in the old days, all this new standard open standards created enablement, which created opportunities for entrepreneurs. Companies were born. New use cases. What are the disruptive enablers that you're seeing come out of this on your end as you roll out the capabilities? Is it faster software? Are people happier? Are people getting along? Are apps being built faster? What are some of the outcomes that you're seeing from this?And I think disruption is key. When ChatGPT came out, we didn't know the disruption it is for you drive. In two months, it had a wildfire-like adoption. So we want to make sure that we are able to disrupt the AI industry by providing customer the right footprint. If you want to deploy AI workload, you should not have to wait for two years. And with the lack of power space and cooling technologies available today, it could take two years. So the first disruption we wanted to do is how do we create this minimal footprint with the largest power capacity possible so that we can deploy a cluster in less than six months? So that's the first disruption. The second is the whole open architecture that we have created with Dell and Broadcom support with the Sonic. It's so sleek. When we are deploying the cluster, our entire networking is completely automated.
It's literally a touch of a button and then a whole networking gets provisioned. Then the storage gets provisioned, then the platform orchestration get provisioned. So being open, at the same time, providing choice, flexibility and providing the lowest DCO is also very-The flexibility, you can stand up very fast infrastructure for them. Arun, talk about where this goes because you guys are supplying all the components, servers, everything. Broadcom's got the networking, everything's working together. What's the portfolio look like? What are some of the choices people have and how are they thinking through this right now? Because-I think it's a great question. I mean, as we go forward, there are many choices. When you think about rack scale architecture, You have NVL 72, NVL 4. Highly scalable systems, massive systems. Then there are people who want liquid cooling, then they want a hundred kilowatt racks. So there's a liquid cooling choices. Then there's choice of PCI GPU inference. Inference as a service. You don't need these rack scale systems. You can have inference a service. So what we are trying to do at Dell is offer customers the entire choice from an extremely large scale rack scalable system for training as a service all the way down to enterprises who want to just do a small-scale deployment, so SLMs, and we have to have solutions. So Dell is trying to provide the gamut of choice across the entire AI infrastructure as we go forward.So talk about the openness to this because now the products are coming out, you guys are innovating on all these areas that frankly is going to accelerate more capability and performance, lower energy. What's next on the horizon? Obviously Ethernet's open. That's getting faster. What are the hot areas so to speak? We got cooling. That's good. It's hot. That's hot. Hot area too.Yeah. So Arun highlighted, there are different class of customers. We work very closely with the hyperscalers and there is this race to singularity. How can I mimic the human brain? And there's massive investment that goes on. The size of these large language models is going to... Every time you go to a next generation, it's 10X the compute. So when we look at this class of customers-Thank you so much.You are looking at data centers that consume... Yes, power is a huge issue like we discussed, but now you're even talking about clusters which are not 32,000 nodes, a hundred thousand nodes. They're like half a million, a million GPUs. And once you're doing this, you're not even in a data center. You're actually going across data centers that may be hundreds of kilometers apart. So how do you solve that problem? Now your supercomputer is now across hundreds of kilometers. So how do you deal with cooling? How do you deal with power? How do you deal with latency in these kind of environments? And then the other class of customers that we have talked to is inference is becoming important as these new GPU vendors coming out. We've talked a lot about scale out for ethernet for example. What we are also now being requested is, "Look, this one domain, which is a scale-up domain in networking.
How can we move to this ethernet?" And we believe that ethernet can have the latency required for these use cases. It has the other capabilities from link-level retries, supporting small packets at very high throughput. And we believe we can solve that problem as well. And that's something that we have-Well, you guys are great. I wish I had more time. Wish I had an hour podcast on this one topic, but I want to wrap up and get it on the record, your vision, all of your visions on AI factories. This is basically the AI factory story playing out in front of us because the end customer at the end of the day are building big apps and they're going to have the new category of generative AI in it and they're going to be big workloads and they're going to provide a lot of business value and societal value, whether it's climate change or business productivity. Because we're going to see that productivity wave coming. That's the killer app, productivity, and the economics are... We've been doing all kinds of forecasts on Dave's got trillions in his mind and we're like, "Okay, how you scope this." But it's clear the magnitude is huge. So this is the AI factory. What's the vision? Where does this go next? Arun, we'll start with you.I think what we're going to see is this AI factory concept is going to go to 500 kilowatt racks. We at 150 today. It's going to go to 500. Validated software, validated hardware. Interoperability across systems in open standards is where the industry is going. Go back to the nineties. Open standards industry won. I think this is what's going to happen here too. In the next three to five years, you're going to see massive rack scale systems, open standards. That's what I see.Hassan.I agree with Arun, which is what we'll see is we'll see the scale become bigger and bigger over the next four years, but we'll also see this go down to other verticals. We will see enterprise adopters. So yes, and from a networking perspective, we believe ethernet will win. It's already on its way and it can scale from the largest clusters on the planet to whatever optimizations that are required for inference and other use cases.Ethernet continues to thunder away its value. Vasali, take us home. You got the factory right there. Your vision where this goes next.I think where this goes next is that how we can create a highly scalable data center campus that can be deployed in less than six months. So that customers who have invested in this very, very costly or expensive GPU and the power, they can harness the benefits of that with their AI workloads so they can be competitive in the market. Their time to market improves, their revenue to market improves. So we are laser focused on serving those customer requirements, partnering with Dell and Broadcom to create as small footprint as possible, as modular as possible, and as efficient as possible such that we can deploy the new data center for the customer in this form factor in the least possible amount of time.Bigger, faster, more power performance, less power usage. We're here in TheCUBE factory doing our part, just sending all the content out here from Supercomputing. Of course, we're going to have our digital twin program. We're going to have in studio content continuing. <inaudible> we've been the digital twin because you mentioned factory, of course. Great job. Thank you so much for coming on.Thank you.AI factories are here. They're real. And they will change the game because the workloads are hungry and they're in demand. Again, great time to be in tech. Thanks for watching.