Any way you look at it, the data and compute demands of the National Institutes of Health are dauntingly large. NIH is composed of 27 institutes or centers. These encompass some 110 laboratories. The agency’s distributed research network moves six petabytes of data daily. NIH also funds research at some 2,500 universities and medical centers, spending more than $40 billion each year.
From an information technology standpoint, NIH is all about big data and big scientific computing. Given the size and scope of data and applications, coupled with the worldwide collaborative nature of NIH activities, it’s no wonder commercial cloud computing has become a large and essential part of the agency’s IT approach.
Much of the cloud planning and contracting falls to NIH’s Center for Information Technology. Its chief information officer, Andrea Norris, sorted it all out when she spoke at Federal News Network’s Cloud Exchange.
“A lot of what we support are in big data, scientific data repositories,” Norris said. “That’s a big chunk of where our technology and data resources are.”
Moreover, that data, ultimately paid for with tax dollars, belongs to the public.
“A lot of our data is public access, public domain. We make it accessible to the general public, and especially, of course, to researchers around the world,” Norris said. “We are a big supporter of public access data. And for science, we believe open access is the way to accelerate discovery. And so we have very large programmatic and discipline-specific data repositories with computational tools and job aids, where researchers come to conduct their science.”
Which, perhaps not surprisingly, makes the cloud a logical option for hosting all of the data and tools the institutes encompass. Norris explained that, in fact, a cloud migration is exactly what the NIH Center for Information Technology has been doing for the last several years. Cloud adoption falls under a program called the STRIDES Initiative, which itself is a component of an agency-wide data science strategic plan.
STRIDES stands for science and technology research infrastructure for discovery, experimentation and sustainability. NIH, under STRIDES, has established relationships with three major commercial cloud services providers: Google Cloud, Amazon Web Services and Microsoft Azure.
Contracts — Norris termed them relationships — with the companies has brought affordable costs to the individual institutes and to the research grantees throughout academia to encourage cloud adoption.
Norris said the arrangements have worked.
In just the last two years, STRIDES has resulted in more than 110 petabytes of scientific data — that’s 112,600 terabytes — moving into the cloud environments.
“Our funded researchers have used more than 100 million compute hours in the cloud,” Norris said. “We’ve trained more than 4,000 researchers in how to use the cloud.” Training efforts cover “everything from basics to advanced and, more importantly, how to do biomedical computing in the cloud. Really, how to do the kind of work we need to see our researchers do.”
“And we’re saving money because the discounts accrue to the collective research environment,” Norris added. “The more we move, the better the discounts are.”
Attractive as the pricing model is, the STRIDES program has also helped NIH foster the health mission itself. Norris cited the example of when the genetic sequence of the COVID-19 virus was finished back in January. The arrangements let NIH instantly share the information worldwide, sparking the global research drive into vaccines.
Thus, the cloud is “literally changing the way that we do science,” Norris said.
From an architectural standpoint, Norris said, the default methodology for avoiding constant data download costs is to establish collaborative workspaces in the cloud that house analytics, utilities and visualization tools. That is, the approach brings the compute to the data, rather than the other way around.
“That’s the model we’re encouraging,” Norris said. “Some of the data sets are just too large; you can’t do it any other way.”
For how the STRIDES Initiative and the bend towards cloud computing plays in a specific instance, Cloud Exchange also heard from Alastair Thomson, the CIO of the National Heart, Lung and Blood Institute.
Thomson said cloud-hosted research has been taking place at the NHLBI for more than three years. Typical, if on the larger side, is a genomic project for precision medicine involving 180,000 participants. The project population is notable for being one of the most racially diverse samples ever. The genomic data of the group, Thomson said, adds up to 3.8 petabytes.
The size makes the cloud a necessity.
“To work with that data, you have to use it in the cloud,” Thomson said. “Without the cloud, you’d be trying to download it. And over whatever broadband connection you have, 3.8 petabytes is going to take a while.”
Thomson cited the STRIDES Initiative as crucial to the institute’s work. An environment called the NHLBI BioData Catalyst, he explained, makes data available to researchers via Amazon and Google. The Catalyst, Thomson said, conforms to Federal Information Security Modernization Act requirements, and is FedRAMP Moderate. Therefore it is suitable to the medical data that entails stringent privacy regulations.
With wide access to the data, he said, researchers have made discoveries in several areas essential to reducing heart disease. By hosting the data and analytic tools in the cloud, researchers can more easily run fine-grained queries. That capability has produced real results in terms of understanding the variations among racial groups’ sensitivities to environmental conditions and treatments — meaning that the sample diversity can produce fresh understanding thanks in part to cloud computing.
Fast cloud access to the genomic data, coupled with Web-interface search tools with semantic understanding, gives researchers the ability to do detailed studies far more quickly in the past. For example, a researcher looking for people with a certain body mass index who have had a heart attack, along with a control group, can assemble the needed data instantly from a search.
“This is something that just setting up an environment to do this would take months for research,” Thomson said. Now, “you can just log in, and you can start doing it.
Plus, there’s a positive budgetary impact from the cloud.
“You can imagine that computing across that kind of data takes extensive compute resources,” Thomson said. “The elasticity of the cloud lets us spin up a large cluster of compute nodes, compute on them, do the analysis, and then shut them down again — rather than having a massive data center that’s sitting near us [operating] 20% or 30% of the time.”