Technology

Oak Ridge National Laboratory’s supercomputer #1 on the Top 500 list

Supercomputers keep getting faster and now the Oak Ridge National Laboratory has switched on a machine that makes 1.1 exaflops of performance. It's called Front...

Tom Temin@tteminWFED

July 28, 2022 12:32 pm

13 min read

Best listening experience is on Chrome, Firefox or Safari. Subscribe to Federal Drive’s daily audio interviews on Apple Podcasts or PodcastOne.

Supercomputers keep getting faster. Just a few years ago it took teraflops — or trillions of floating point operations per second — to make the list of the world’s fastest computers. Now it takes exaflops, quintillions of operations per second. And now the Oak Ridge National Laboratory has switched on a machine that makes 1.1 exaflops of performance. It’s called Frontier. The Federal Drive with Tom Temin talked about Frontier with Oak Ridge distinguished scientist and Frontier project officer, Scott Atchley.

Interview transcript:

Tom Temin: Mr. Atchley, good to have you on.

        Are agencies getting enough bang for their cyber buck? See how tech leaders feel about the state of their organization and budgets in our new survey, sponsored by Axonius.

Scott Atchley: Good morning, Tom, I appreciate you having me on.

Tom Temin: And just review for us some highlights about this super super computer. I guess it’s number one on the Top 500 list, making it the fastest in the world. Tell me how it supports Oak Ridge, what types of projects at Oak Ridge will this support? And maybe it’s networked into some of the other labs too, I imagine.

Scott Atchley: Yeah. So Oak Ridge has a leadership computing facility. So this is one of two facilities within the Department of Energy that focus on what we call leadership computing. Leadership computing uses a large fraction of these large machines to run problems, to solve problems at a scale that you just can’t run anywhere else. So the users that come to Oak Ridge and to Argonne have problems that require large resources, or maybe a large amount of memory. Definitely fast networks. They’re trying to improve the resolution of their simulation and modeling, or as we’re seeing more and more using machine learning or deep learning as part of artificial intelligence. And they just need more resources that they can get anywhere else in the world.

Tom Temin: And this machine is physically large, correct? How big is it? In terms of square footage?

Scott Atchley: Yes, it’s about 400 meters square, about the size of a basketball court a little bit bigger than about a basketball court. It is similar in size to our previous machines, but just much, much faster.

Tom Temin: And did contractors build this? Is it something that you designed at Oak Ridge? Or how does that work? How does it come to be?

Scott Atchley: So with these large systems within the Department of Energy, we have a rigorous procurement process. And we will put out requests for proposals. And we’ll get proposals from multiple vendors, we’ll do a technical review, we then award one of those vendors the contract, and they will then start working on the machine. Now we tend to buy these multiple years in advance. So we’ve started deploying Frontier last year, pretty much September, October timeframe is when the hardware came in. We actually selected the vendor Cray back in 2018. And so that was to give them time, they had proposed new processors from AMD. And they gave them time to work out all of that technology, and also gave us time to prepare the machine room. So we had to add more power, we had to bring in more power, we had to bring in more cooling. The floor in there would have collapsed with this new machine because it’s so heavy. So we actually had to tear out the old floor and build a new raised floor for Frontier to handle the weight. Frontier is made up of 74 cabinets, each one of these cabinets is four foot by six foot a little bit smaller than a pickup truck bed, but weighs as much as two F150 pickups in that space. So very, very dense.

Tom Temin: Got it. And did the chip shortage and worldwide supply chain affect the delivery and ability to build this on time at all?

        Read more: Technology

Scott Atchley: Oh, absolutely. We were in the preparation stage. And I went to visit the factory in May of last year. And we kept asking them, are you having any supply chain issues? And they said, well, some but not too bad. And when I got up there, they pulled me into a room and said we were having some issues. Here’s 150 parts we can’t get. And you’re dealing with a system that has billions of parts, billions of types of parts, not just a million parts total. And you only need to be short of one. And it doesn’t have to be an expensive processor. It can be a $2 power chip or a 50 cent screw. Any one of those will stop you from getting your system. And so yeah, it was a huge issue. Fortunately, HPE had bought Cray in the interim from when we awarded the contract to when they were building this system. And HPE had very good supply chains, they were able to reach out to many, many different companies to try to source components. They pulled off a heroic job of getting us the stuff; it did delay us. It probably delayed us about two months. But at that meeting in May, they told us they could delay us up to six months. So that’s how good of a job they did for us. So we really appreciate the effort that they did.

Tom Temin: We’re speaking with Scott Atchley, he’s distinguished scientist and supercomputer Frontier project officer at the Oak Ridge National Laboratory. The processor chips, the AMDs, those are still manufactured in the United States, correct? And the memory is what is made overseas?

Scott Atchley: It’s a little bit of both. So they’re designed in the U.S. but the leading computer fabrication facility or we just call it fab is located in Taiwan that’s TSMC. The other leading fabs are Samsung in South Korea and then Intel in the U.S. and so Intel is starting to talk about doing fab services for other companies. But up until this point, they’ve only manufactured their own hardware. So whether it’s NVIDIA or AMD, you know all the leading edge processes other than Intel go to TSMC. But interestingly, even right now, Intel is using TSMC for some of their components for the Aurora system at Argonne.

Tom Temin: Right. So that’s why we’re gonna vote pretty soon to to subsidize them all?

Scott Atchley: We definitely want the capability to fab these in the U.S. for various reasons, you know, geopolitical reasons. And we also want that workforce in the U.S. So absolutely.

Tom Temin: And I think people may not realize that the chip itself represents a gigantic supply chain of equipment, gases, materials, that enable the fabrication of it. And so, you know, there’s a couple of billion dollars worth of investment just to make one wafer, I guess, and people may not realize how deeply this goes into the economy.

Scott Atchley: Oh, absolutely. It’s a huge amount. And there’s ripple effects, if you can bring the fabs to the U.S. and we have some here, but bring more and particularly the leading edge fabs the U.S. the ripple effects be fantastic.

        Sign up for our daily newsletter so you never miss a beat on all things federal

Tom Temin: And in planning the installation of a machine like this, what about the programs, the applications, the programming that has to go? Is there some long term planning that people that want to use it eventually also have to do so that their code will run the way they hope it will?

Scott Atchley: Absolutely. So as soon as we select the vendor, we set up a what we call the Center of Excellence. And that is a team of scientists and developers from the lab, but also with the vendor integrators, in this case, HPE, and then their component supplier, AMD. And so we have selected, you know, 12 or 14 applications that we want them to start working on. Because what you want to do, I mean, these machines are very expensive, when you turn that machine on, you want to be able to do science on day one. And so they start working on these applications and porting them to the new architecture. And then as the previous generation chips become available, they start running on those. And then when the early silicon becomes available for the final architecture, they start running there, and they start their final tuning and optimizing. This process starts as soon as we select that vendor.

Tom Temin: And so it’s not necessarily the case that a given set of code for a application or a simulation or a visualization will necessarily run optimally on the faster hardware, you need to tweak your software to get the most out of the new hardware?

Scott Atchley: Absolutely. So even if you’re buying from the same vendor, when we moved from Titan to Summit, which is our current production system, they both used NVIDIA GPUs. So the API didn’t change a whole lot, but the architecture of the GPUs changed quite a bit. And so you still have to adjust for the different ratios of memory capacity and memory bandwidth to the amount of processing power. And so that is a good part of the process is doing that optimization and tuning for that given architecture.

Tom Temin: That’s an interesting point about supercomputers. It’s much more like the beginning of computing, in the sense that you need to write carefully to the hardware, as opposed to most business computing today where you’re just writing to an API. And you figure pretty much for most business applications, even AI, that the hardware is fast enough for whatever translation layers in between, actually do talk to the hardware.

Scott Atchley: Absolutely. We’re trying to eke out as much performance as we can and the applications are running. We don’t use virtualization and all these other techniques that you can use to increase the usefulness of your hardware, we have a high demand, there’s a competitive process to get access to the machine, and you get an allocation of time. And so you want to make sure that time is as useful as possible. Think of it as a telescope, and you’re a scientist studying the stars, you want to be prepared, when your week comes up, and you get to go to that telescope, and it’s yours for that week, you don’t want to waste your time by being inefficient, which you do. So the same thing here, the users don’t have to physically be present, but they have to be able to remotely log into our system. When they’re on the machine, they want it to be as efficient as possible and get as much of that performance as they can.

Tom Temin: And what are the power requirements for a machine like this? Do you have to call up the Tennessee Valley Authority and say, hey, we’re going to turn it on?

Scott Atchley: That’s a great question. So when we were doing some of our benchmark runs to help shake the system out, you’re running various applications, but the one that we use the most is the HPL, or high performance LINPACK application. That’s the one that’s used to rank the systems on the top 500 list, but it’s a fantastic tool to help you, you know, debug the machine and find the marginal hardware and replace it with better hardware. And so I was watching the power as our teams were submitting jobs using the whole machine and you would see a spike from the baseline power to the maximum power, which was a 15 megawatt increase in five seconds. And you know, the job would run a little bit and then you’d have a node crash, it would die and they would do it again. And so over and over, we were throwing 15 megawatts on the machine and then it would, you know, finish or crash, and then that would go away instantaneously. And I’m thinking, we’re going to get that phone call from TVA, and it’s not going to be a good one. It didn’t come. And I actually know somebody that works at TVA, and I just called him up. I said, hey, by the way, we’re doing this, is this causing you guys any problems? So well, I don’t know, let me let me check with headquarters, calls me back a couple hours later. And just laughs and says, no, we didn’t see a thing. I said, if you can’t see 15 megawatts coming and going, and in five seconds, you’ve got a lot of capacity. He says, yeah, we average about 24 gigawatts at any particular time. So yeah, that’s less than 1%. So to us, it’s huge. But fortunately, we don’t cause the lights to flicker here or anywhere else nearby. So it’s all good.

Tom Temin: So plenty of juice left over for Dogpatch, you know, down there.

Scott Atchley: Absolutely. We’re not going to slow down anybody’s Fortnite game for sure.

Tom Temin: And just briefly, what is your job like day to day do you touch the machine and interact with it personally, are you just kind of more like looking at spreadsheets and power reports and schedules?

Scott Atchley: So unfortunately, I attend meetings, that seems to be my major contribution to the Department of Energy, the machine is still undergoing stand up. And so we probably have a couple months to go maybe a little bit longer as we test the system and make sure that it’s ready to put users on. And so I’m not part of that team. I’m tracking what they do daily. So some of the meetings I attend are with our acceptance team, also with the vendor to make sure that we are addressing the issues that we’re discovering, so that we can get it ready for users. After the machine goes into production, I don’t really need to get on it. It’s really at that point dedicated to the users, we’re actually starting to think about its replacement. And so we actually have a mission needs statement into DOE that talks about, you know, we’ll need a machine after Frontier, you know, five years from now. And we were actually starting the process of thinking about the procurement of that machine. And so our expectation is that we’ll put out a request for proposals sometime next year. And by the end of next year, we’ll know what the architecture is that will replace Frontier.

Tom Temin: But we’re still a few years from zettabyte computers, we have to get multiple exabytes at this point. Correct?

Scott Atchley: It’s becoming more difficult, right? So, three machines ago. So back in 2008 timeframe, we were right at the petabytes level, so roughly two petabytes. Our next system Titan was deployed in about 2012. That was on the order of 20 petabytes. In 2017 or 2018, we deployed Summit, which was it’s 200 petabytes, and that’s still in production, it will stay in production for a couple more years. And so roughly an order of magnitude every five years, but that is becoming more difficult. You hear stories about the slowing of Moore’s law, you’ll hear people say the end of Moore’s law. And that’s that’s a little too pessimistic right now, but it is slowing so it may take us a little bit longer to get those powers of 10. So we are definitely a few years away from looking at zettaflops.

Tom Temin: Scott actually is distinguished scientist and supercomputer Frontier project officer at the Oak Ridge National Laboratory. Thanks so much for joining me.

Scott Atchley: Tom. Thank you very much. It was a pleasure and have a good day.

Tom Temin

Tom Temin is host of the Federal Drive and has been providing insight on federal technology and management issues for more than 30 years.

Follow @tteminWFED

Oak Ridge National Laboratory’s supercomputer #1 on the Top 500 list

Related Stories

Energy’s Oak Ridge Lab adapting cybersecurity for this new boundaryless environment

Oak Ridge National Laboratory’s supercomputer #1 on the Top 500 list

Energy Department takes the title owning the fastest computer in the world

Related Stories

Top Stories