Genomics and other health care areas produce large data sets, and to use them in research, scientists need the ability to access to the information. However, the data they’ve collected is often too big to copy or distribute.
But the National Institutes of Health may have a plan up its sleeve — establishing what it calls a data commons in the cloud. The agency enlisted researchers and industry to design the commons.
“The data commons is in response to the recognition that we are undergoing a tsunami of data, a lot of it driven by the Human Genome Project and the fact that it now costs [around] $800 to sequence a human genome,” Alastair Thomson, CIO of the National Heart, Lung and Blood Institute (NHLBI), said onCloud Insights month. “So we have petabytes of data which is coming into the NIH for genomics, but also for imaging and a variety of other data.”
NIH has a budget of about $33 billion but gives 85 percent of it to grantees. Thomson said the agency’s collection of such a large data amount of data is a byproduct of this.
“They are generating vast amounts of data, and we need to make it available so that researchers can use it and if it’s sitting in the data center, it’s not available,” he said on Federal Drive with Tom Temin. “So, the data commons is aimed at making that data accessible to them, allowing them to compute on it and conduct analyses in a way that has never been done before.”
The data that NHLBI has collected so far is at two petabytes, or just over 2.1 million megabytes. That’s 100,000 whole genome sequences. But because the genome project is growing, Thomson said its likely to jump up to 17 petabytes, or 18.2 trillion megabytes, and its impossible to download that large amount of data. Even to analyze the data accurately would require a certain level of supercomputing.
“It’s just not practical to do that in your local research institutes or data centers,” he said. “The only practical way to do it is in the cloud.”
A large quantity of the data is focused on genomics, but it also includes over a million images from research studies and electronic health records.
To look deeper into the research and development of this new commons, NIH created the team, including Thomson. He said the team also hosts several other people from NHLBI and the National Human Genome Research Institute.
The genome project is being conducted to signal particular genetic variations that may be associated with diseases and other health problems—such as cancers or obesity. The health data comes from individuals who have granted NIH centers to access to their records and hopefully use the genetic data to test, and find treatments.
“One of our challenges now, is how do we make this data available while we respect the wishes of those patients as to how the data is used,” Thomson said.
The initiative is funded through the NIH common fund, originally set up by Congress. Funding the research this way allows NIH to create more public-private partnerships. In other words, the team of NIH researchers can provide hands-on help to industry partners. Thomson said this allows us to be deeply involved in the process and not just throw money at it.
“We’re doing it in a way that allows us to tap into the innovation in a better way than you could with a contract, for instance,” he said. “If I’m being frank, the government isn’t particularly good at large IT projects. So we’re really a leading industry and the academic community drives us.”
Two researchers from Harvard University were awarded to develop the platform for the new commons. But NIH also granted eight more awards to other researchers and academic institutions across the country.
The goal of the projects for these awardees is to come up with architecture that will support these large sets of data, and make it more fundable, accessible, interoperable and reusable. Each of the ten grantees will focus on these key capabilities in a variety of ways. But Thomson said their projects must make data available to the appropriate people and still be accessible on cloud agnostic architectures.
“What we do in the data commons has to be portable across all the major cloud providers,” he said.
Establishing this cloud server will take more than just researchers, academia and non-profits. It will also include opinion and research from both the commercial and government sectors. The server is intended to be a platform that can even be used by others for full commercial benefits, Thomson said.
“If they can use the data while respecting the consent with the patients and build products on it that make them money, that’s actually to the benefit of NIH, because we don’t have to pay as much to store the data or to compute on the data.” he said. “Plus, to [benefit] society in general, as those organizations are conducting research and putting knowledge back into the science community.”
NIH’s data commons will begin on a commercial cloud platform. Currently the three biggest cloud providers are Amazon Web Services, Microsoft Azure or Google Cloud.
Thomson said all three are represented on the teams granted awards for this project.