The Energy Department installs the latest in its fleet of supercomputers

It's called Kestrel, but it's not a falcon catching mice. It's the newest Energy Department supercomputer. Kestrel just arrived at the National Renewable Energy...

It’s called Kestrel, but it’s not a falcon catching mice. It’s the newest Energy Department supercomputer. Kestrel just arrived at the National Renewable Energy Laboratory in Golden, Colorado. To look deeper into what it will do and to hear about some of Kestrel’s amazing statistics, Federal Drive with Tom Temin spoke with Program Manager Kristin Munch.

Interview transcript:

Tom Temin So tell us about Kestrel. First of all, who built it? Because these things are made from standard types of components, just a whole lot of them interconnected in a unique way. Tell us about the architecture of this computer.

Kristin Munch Kestrel is being built by Hewlett Packard Enterprises, and it is NREL’s third generation [High performance computing (HPC)] system, but it’s actually a pretty big step up for us. So we’re going from an eight petaFLOPS system on Eagle to a 44 petaFLOPS system on Kestrel, kind of like a five and a half times increase in computing capability for us.

Tom Temin And you’re not shutting off the old one, that’ll still operate?

Kristin Munch It’ll operate for a little while to enable a transition.

Tom Temin Got it. So there’s no way of combining eight plus 44 permanently. And then you’ve got 56 petaFLOPS. That just doesn’t work that way?

Kristin Munch Usually they take up so much room that you kind of have to get the other one out of there.

Tom Temin And kind of ironically, this is for the Energy Department. You’re going to be looking, and we’ll get into the mission in a moment of renewable energy. Yet, how do you power a thing like this?

Kristin Munch Well, we actually had to do a power upgrade into our data center for this. So we’re going to be going up to about a seven and a half megawatt data center. So we’re adding about four megawatts to our data center in order to power Kestrel. But we still have a little bit of room left there, so we’re not using that full seven and a half megawatts.

Tom Temin All right. Let’s talk about why Kestrel. What are the big challenges that the lab is working on right now?

Kristin Munch So the research that is done on Kestrel, the thing that’s unique about Kestrel is that it is the computing facility dedicated to the EERE mission, the Energy Efficiency and Renewable Energy Office. The research that’s done on there is researchers from actually almost all of the national labs, including NREL, industry and academia users are on there. They do everything from fundamental materials science work for next generation solar cells, carbon neutral fuels. They do a forecasting of solar and wind resources. They simulate offshore wind farms to try to figure out how to get the best performance out of them. And another big thing they do is they run hundreds or even thousands of scenarios of the future grid to kind of explore options of how to get to a renewable future on our power sector.

Tom Temin That’s really a big one, too, isn’t it? Because I think people have the sense that the grid is getting increasingly fragile and you have brownouts and blackouts. And we didn’t think of ourselves as a third world country. And so, I guess, one of the challenges is to stay not a third world country in terms of power.

Kristin Munch Exactly. So it’s not only like what renewable sources you add to the grid, it’s how you do it and when, and how do you make the grid resilient.

Tom Temin Grid resiliency, though, is important even with the power mix that we have now.

Kristin Munch Exactly.

Tom Temin And how does this operate for all of these different parties that wish to access the computer? It’s a timesharing schedule type of basis.

Kristin Munch Exactly. That’s actually a really good question because it’s kind of timely. We have our annual call going out in just a couple of weeks on May 10. So what happens is NREL, on behalf of EERE, runs an annual open call every spring and people apply. They’ll apply for doing time on Kestrel this next year, and they’re given time through EERE approval process. And their time starts on Oct.1 for one year. So it’s the fiscal year.

Tom Temin Got it. And what is a typical time unit for a machine like this? A problem I could come up with would take about one-tenth of one petaFLOP and it would be over in 4 seconds. Do some of these things take all night or maybe a whole day to run type of measure?

Kristin Munch Oh, yes. Even longer than that. So we’ll have jobs on the supercomputer that can run for several weeks even. And one of the big things about the architecture is it has to be capable of running these jobs for a very long time across many, many nodes of compute nodes and storing that data instantaneously to our parallel storage system. So, yeah, we have jobs that run a very long time, but we also have jobs that are shorter, but they run thousands or even millions of them.

Tom Temin So therefore, the people that are developing the programs that will run on it, the applications, have to do a lot of error correction and recovery, because you don’t want the thing hanging up in the middle of the night. And it’s a day later until someone realizes it’s hung up.

Kristin Munch Yeah, we have lots of different programs in place that can troubleshoot things like that. We also have a team of computational experts that are available to help with that at NREL. So we get involved with some of our users codes, making sure they’re running efficiently, and they don’t have any problems.

Tom Temin We’re speaking with Kristin Munch. She’s laboratory program manager for advanced computing at the National Renewable Energy Laboratory in Colorado. And is this a fee for service type of thing? That is Kestrel paying for itself by fees from the users?

Kristin Munch Actually, Kestrel is purchased by EERE in order to enable EERE’s research. So the researchers themselves don’t have to pay to use Kestrel.

Tom Temin Wow. So it’s all funded by the government. You just have to have a worthy reason to be able to use Kestrel.

Kristin Munch Exactly. Very similar to the other supercomputers at the other national labs.

Tom Temin All right. And what’s the status of the machine now? Is it installed and debugged? And how do you know it’s ready to switch on?

Kristin Munch So it just arrived about a month ago. So we’re still in the middle of kind of bringing it up, powering it on, making sure all the components are working like they should. We’ll start a phase called acceptance testing in the next couple of weeks probably, and that lasts for a few months. So we’ll bring Kestrel up officially sometime this summer. That’s the first phase of Kestrel with the [Central Processing Unit (CPU)] nodes. We also have a second phase where we’re adding [Graphics Processing Unit (GPU)] nodes later in the fall.

Tom Temin And do you have certain programs that you know what the outcome should be and how long it should take as kind of indicators to run to test it with?

Kristin Munch Yes, we actually have a whole benchmarking team that’s running very specific benchmarks that represent all the codes that our users run on Kestrel to make sure everything’s working properly.

Tom Temin And because it’s made of so many, I guess, racks and each rack has lots of blades in it and so on, they fail from time to time. So there must be a staff around all the time ready to pop in a new blade or a whole new rack unit if necessary.

Kristin Munch Yes, exactly. So our Computational Science Center has an operations team that manages most of that, but we also have maintenance contracts with the vendors, and so they can send people in for certain types of issues to as needed.

Tom Temin And by the way, how big is Kestrel? Is it like the size of a microbus or is it the size of a barn or what? What kind of square footage does it take?

Kristin Munch It’s taking up about 2500 square feet or so. It’s about a quarter of our data center. It’s about if you can picture compute racks in a data center, it’s three rows of compute racks. So a CPU row, a storage row and a GPU row.

Tom Temin And a generation ago, the same power would have been ten times as big, probably.

Kristin Munch One generation ago, yeah, probably took about four rows. Really, the increasing compute capability is not really the number of nodes anymore. You kind of still need the same number of nodes, but they’re all much more powerful because of the processor technology.

Tom Temin Yeah, it’s down to the chip’s density really is the big difference.

Kristin Munch Right.

Tom Temin And do people have to make sure that the programs they develop for it conform to the way in which it can be used the most efficiently? That is to say, just to be as a non-computer scientist. I would say, you don’t want to send a floating point type of problem down to a integer type of computer.

Kristin Munch Right. So most of our codes have already been running on Eagle and even the generation before. So it’s really a matter of making sure the codes run and are compiled for these particular processors. And we do get a lot of help from the actual processor vendors too, to make sure that happens. So hopefully there’s not as much work on the users running the codes, and we’re there to help them if there are any issues.

Tom Temin And federal officers often get new things, maybe new furniture, maybe a new copier, this is more like a big deal, isn’t it? Almost as if the Air Force was getting a new bomber. Correct?

Kristin Munch Yeah, exactly. It’s a big investment. And EERE is kind of making that investment in making sure that we have some dedicated compute resources to help us solve these problems.

Tom Temin By the way, is battery technology? That seems to be the other grand challenge here besides the grid. But battery technology is key to almost everything in renewable for practical application. That’s part of the problem set?

Kristin Munch Yeah, we do have people who work on battery technologies from the vehicles office.

Tom Temin Wow. So when people at cookouts and stuff out there in Golden, Colorado have problems with their updates and stuff, do they come to you because they know you’ve got the biggest computer in the state?

Kristin Munch They can. They definitely can do that. We do have a lot of local universities that use the computer.

Tom Temin But I mean, do they ask, Kristin, hey, I’m having trouble with the software. If you can do things on Kestrel, you can probably fix my Mac.

Kristin Munch They might now. Now That I’m talking to you, I don’t know.

 

Copyright © 2024 Federal News Network. All rights reserved. This website is not intended for users located within the European Economic Area.

Related Stories

    Jim Lujan, Los Alamos National Laboratory program manager for advanced simulation and computing

    Keeping supercomputers cool

    Read more