Given its vast data stores, the National Institutes of Health is embracing cloud to make access easier and faster. We talk with data scientist Susan Gregurick, ...
Susan Gregurick, NIH’s associate director for data science and the director of its Office of Data Science Strategy, said the agency’s ongoing cloud interoperability platform initiative will create a “data mesh” that will make it easier for researchers to work across the multiple data sources.
“Cloud has really democratized our access to data and really improved the way in which researchers can work with data,” Gregurick said during the Federal News Network Cloud Exchange 2022. “Even if it’s in the cloud, it doesn’t mean that it’s all there and accessible. There’s many different platforms within the cloud. But creating that sort of data mesh across the cloud is the next big challenge.”
NIH intends to make easier data-sharing in the cloud possible in part through its Researcher Auth Service. RAS is a single sign-on capability for researchers to access data platforms across the agency’s institutes.
“All of these are really important platforms, but they’re all silos in some way. Our goal is to break down that capability, to build interconnections where it makes sense,” Gregurick said.
Researchers currently have access to at least 16 data platforms as part of the authentication process. NIH piloted RAS for about two years, and it’s now in “production mode,” she said
“It’s hard to tell you how many different ways in which researchers can access data work with analytics in the cloud, but there are at least 20 or more cloud-based data platforms that house data. Much of it is controlled-access data because it’s coming from participants of studies. What we want to do is just to make it much easier. That once the researcher has the approval to get the data, that they can actually get the data through an authentication authorization process,” Gregurik said.
Easier access to NIH data lets researchers do more with the resources available to them. “Data access is a priority. Leveraging cloud infrastructure is another way to do that,” she said.
Gregurick described a scenario in which researchers aggregated data from three different platforms and conducted analysis on a single platform without ever having to copy or move the outside data.
“That is facilitated because all of this data is in the cloud. You can’t do that sort of data mesh or that interoperability as easily if you’re not on the cloud,” she said. “I’m sure that we could do it in a hybrid computing environment, but it’s so much more accessible and easier if you have a single sign-on way to authenticate across multiple platforms, authorize the data across multiple platforms and then a way to pull that data and aggregate it, and do analysis without ever having to copy it. The cloud is perfect for that.”
NIH’s cloud modernization efforts also will allow the agency to share data with researchers more quickly.
Requests from researchers to access NIH’s database of Genotypes and Phenotypes (dbGaP), for example, are routed through the agency’s Data Access Committee. It currently takes about two weeks, on average, for researchers to obtain data from NIH through this process.
“There’s a lot of inefficiencies there. There’s a lot of ways in which we can improve that through automation,” Gregurick said.
NIH expects efforts to automate aspects of the review process will provide researchers with “near real-time access” to data or will at least reduce the waiting period to 24 hours, she said. “The cloud will play an important role in that, because many of these data are now being housed in different types of data platforms on the cloud. Linking that automated process to the cloud is really the next big step.”
This automation work began in fiscal 2022 and will continue through fiscal 2023, she added.
The COVID-19 pandemic led to additional efforts by NIH to make data in the cloud more accessible.
The agency launched the COVID Rapid Acceleration of Diagnostics (RADx) Data Hub, a centralized, secure repository to store and search for vast amounts of de-identified data related to COVID-19 testing.
“This will be a sort of cloud-based way for researchers to engage with that data, bring their own data, if they’re interested, and really investigate all of the ways that COVID has spread across our country and what testing data can provide in terms of answering research questions,” Gregurick said.
NIH’s National Institute of Biomedical Imaging and Bioengineering is also funding the Medical Imaging and Data Resource Center (MIDRC), a high-quality repository of medical images related to COVID-19 and associated clinical data.
Among its goals, the repository is meant to develop medical image-based artificial intelligence to detect, diagnose and monitor COVID-19.
“There’s a lot of really good data coming from CT scans, for example, from patients who have had COVID. Gathering all of those radiological data, harmonizing it to a very high level, making it available for researchers and making it artificial intelligence–friendly has been the goal of MIDRC,” Gregurick said.
In terms of future projects to facilitate data sharing, Gregurick said NIH considers ongoing feedback from researchers as an essential part of understanding where the agency can improve its capabilities.
It hopes to develop feedback loops in a number of ways, including in person. In January, for instance, NIH held a workshop on data discovery for the research community.
“Research and working in the federal space really does hinge on that open communication with the research community,” Gregurick said. “We try to understand each other. We’re both trying to push the boundaries of science — us from the federal side, and of course, them from the research and scientific inquiry side. It’s been a wonderful partnership.”
Check out all the sessions from the Federal News Network Cloud Exchange 2022.
Copyright © 2024 Federal News Network. All rights reserved. This website is not intended for users located within the European Economic Area.
Jory Heckman is a reporter at Federal News Network covering U.S. Postal Service, IRS, big data and technology issues.
Follow @jheckmanWFED
Associate Director for Data Science, National Institutes of Health
Reporter, Federal News Network
Associate Director for Data Science, National Institutes of Health
Reporter, Federal News Network