Keeping supercomputers cool

Best listening experience is on Chrome, Firefox or Safari. Subscribe to Federal Drive’s daily audio interviews on Apple Podcasts or PodcastOne.

Talk about infrastructure. The National Nuclear Security Administration has completed what might be called five of the world’s largest refrigerators. They’ll eventually keep the chill on some mighty supercomputers to be constructed nearby. It’s all taking place at the Los Alamos National Laboratory. With more of what’s going on, the program manager for advanced simulation and computing, Jim Lujan, joined Federal Drive with Tom Temin.

Interview transcript:

Tom Temin: Mr. Lujan. Good to have you on.

Jim Lujan: Thank you. I appreciate the opportunity.

Tom Temin: Now this picture that I’m looking at, five gigantic air conditioners, tell us what you’ve done so far here. There was a ribbon cutting for it.

Jim Lujan: Right. Lisa Gordon-Hagerty came out and was able to do the ribbon cutting ceremony on our new cooling project. What you can see from outside of the building are five large evaporative cooling towers. This is the first step in taking water temperature and starting to cool it down. And that’s the external piece of the cooling system. The internal piece, unfortunately, which is not visible, is a significant amount of water pipes and pumps to move all of that water once it’s cooled into the competing facility. These pipes are massive 36 inch pipes, they’re moving tens of thousands of gallons of water every minute as part of this cooling effort.

Tom Temin: Sounds like a good place to hide some leftovers I guess if you’re working there every day. What are these designed to cool eventually.

Jim Lujan: So this cooling is for cooling down our new computer systems that are going to be coming in over the next several years. The first of the computing systems that will take advantage of this new cooling is Crossroads. And that’s slated to come in towards the end of calendar year 2021. When we invest money in this large scale infrastructure it’s to support not just one computer systems, but multiple computer systems. And these are all going to be exascale class systems. Our computer room to just kind of help you visualize is 43,000 square feet. It’s about a yard shy of a football field, one giant room full of computing systems. So once those computers are all kind of going at maximum capability, they use a lot of electricity, which in turn generates a lot of heat. And this cooling capability will help dissipate that heat for us.

Tom Temin: And this exascale computing project is green-field, that is to say you’re not moving something out of a building and something new in, or you’re starting sounds like all brand new.

Jim Lujan: Well, so our computing facility has been in operations since 2001. The primary means of doing cooling was air cooling at the time, and we’ve transitioned now to warm water cooling because it is far more efficient at removing the heat, so it’s lower cost, we’re ultimately trying to reduce how much power we’re generating to provide cooling. And so the warm water cooling project was one of the first steps in improving our overall power efficiency.

Tom Temin: So the computers will go into the same places that are where all of these pipes are, and somehow they’ll connect in some manner, like a radiator.

Jim Lujan: Right. So our building is a three story building, so all of the pipes and pumps and actually the breaker boxes, the power, all of that power and cooling distribution is on the first floor. And then all of our computers are on the second floor. Now the room is not empty. We do run productions for the NNSA, 7 by 24, we have existing systems there. But we’re at the point where newer systems as they come in are going to generate additional load and these cooling towers and pumps and heat exchange are in anticipation of the increased demand. So we always have computing going on in there. But as we move forward in time, computers are getting larger, they’re getting more dense, they’re consuming more power in order to provide more cycles — and that translates into the need to do more cooling.

Tom Temin: As you bring in Crossroads, and then computer systems beyond that, and that are going to be exascale, maybe briefly tell us what exascale means. And more importantly, what you plan for all of this computing power. Why are you building it all in?

Jim Lujan: Good question. So exascale for NNSA and for the Department of Energy really is sort of that next big leap in computational power. In the late 2000s, the Road Runner system at Los Alamos made a major breakthrough in computational power in providing roughly 100 times more compute capability than what was anything else out there, and it broke that barrier of 10 to the 15th computing cycles per second. So exascale is 1000 times that. So that will be another big leap in computing and based on technologies, etc, that’s going to be occurring on the 2021-22 timeframe. So that next big leap will be able to provide an NNSA with the ability to use the computer resource for simulation capability and monitoring and understanding the nation’s nuclear stockpile. So part of the role of the computers overall in stockpile management is helping understand the complex physics and engineering that are involved in these systems. And since we signed the Nuclear Test Ban Treaty in the 90s, we need way to be able to understand the complex physical phenomenon within these systems without actually going out and doing a full scale test. So our computer simulations are the best way in order to do that. And as we move in time, in order to have better granularity, better fidelity in understanding how these physical phenomena work, it means more compute power, more compute power translates into bigger systems, faster systems, which then translates into, needing to power and cool them. So what you saw, as far as that picture was just one piece of the overall system and helping facilitate stockpile management for NNSA.

Tom Temin: And as you move from 10 to the 15th to 10 to the 18th flops per second, are you able to have time to maybe rent out the supercomputer to other users or is this pretty much NNSA requirements, use it up all of the time that it’s available.

Jim Lujan: So this is an NNSA facility. These large systems do support the other NNSA laboratories. So Los Alamos has one large system now called Trinity that’s in production use, and that supports the bench simulation and computing program at Los Alamos, but also Sandia National Laboratories and Lawrence Livermore National Laboratory. Livermore has their large machine right now that they have installed and is starting to provide cycles, and that is Sierra. And Sierra, again, provide cycles to the tri-lab community. And then if you go back to the sort of tick-tock model, Crossroads is the one coming in towards the latter part of 2021. And again, that’ll be a trial machine. Now there are other computational resources in there that also support primarily Los Alamos’s ASC computing, NNSA computing requirements. Livermore has some, Sandia has some of these smaller systems. But yes, these cycles are exclusively for the use for a stockpile management on behalf of the NNSA. We’re not a cloud computing provider for other organizations. We are essentially an internal cycle provider for an NNSA.

Tom Temin: I guess they can look in lust and wish they had that kind of power. And with respect to the electricity, where does that come from? You can’t just plug it in like you plug in a lamp to the Los Alamos local grid.

Jim Lujan: Well, you’re partially right. So for Los Alamos, we do have a utility grid structure, we bring in power from three major electrical conduction lines into Los Alamos County. And we are one of the single largest consumers of electricity in the county. So we do have to plan long term just like we’ve done cooling, we also have to do long term planning on power to make sure that the electrical conducting lines can support the amount of power draw that we’re doing. So these are long lead projects in trying to plan out our computing requirements along with power. So as an example for some of the power that we consume, Trinity as I mentioned is in operations today. Trinity when it’s running at capacity consumes about 10 megawatts of power. If you can imagine about 1000 residential units, homes, are about a megawatt. So Los Alamos County is about 10 megawatts with just the residences and the business outside of the laboratory. So Trinity by itself is consuming as much electricity as the town of Los Alamos. So these are big deals and planning and also complex. And so Crossroads is slated to be even larger. And so we have to plan on that power demand, as well like I said the ability to cool it.

Tom Temin: And how do you plan the budgeting for these types of efforts that are capital expenditures that take several years before you can flip them on, yet agencies live by year to year appropriations? How do you do that part of it?

Jim Lujan: Well, that’s exactly right. These are major projects within the Department of Energy. So there’s very formal planning that happens years in advance for like the capital for power and cooling. Sometimes that planning happens as well as seven years in advance. The planning for the computer systems, those formal projects, all start four to five years before that happens. There is a long term budget plan for capital large scale expenditures within the NNSA, and also for the computer systems. What we do is we start off with developing mission need, why do we need this capability? And then that mission need gets approved. And then we start going forward with initial project estimations. What are the costs, what are the risks associated with putting together a baseline schedule? Forming a multi year budget strategy in order to achieve these projects, and then finally putting together a formal project execution plan, and getting that verified, and then moving forward. So it truly is a multi year plan budget effort. And while there is the sort of annual year to year budgets that we have to work with, part of that is putting in that multi year plan. And sometimes we do have to adjust depending on what budget is approved in that particular year moving forward.

Tom Temin: But this sounds pretty exciting knowing that this new extra scale is going to come online late in 2021. You’ll be there to maybe flip the switch.

Jim Lujan: I will be there to flip the switch. I’m very excited to have Crossroads be a part of this. And as we said just just a bit ago, we’re already starting the planning for the machine that comes after Crossroads because like I said, these are such long lead time items. So very excited what we can currently do with Trinity. What will be here pretty soon with Crossroads, and then looking beyond Crossroads, the technologies and the power and cooling necessary to achieve that as well

Tom Temin: Jim Lujan is program manager for advanced simulation and computing at the National Nuclear Security Administration. Thanks so much for joining me.

Jim Lujan: Thank you.

Find more information here.