The National Cancer Institute is turning to the power of the cloud to help cure cancer.
The NCI will award several contracts under a broad agency announcement by the fall for vendors to demonstrate how they can make available 2.5 petabytes, or 20 million gigabytes, of data to the research community.
George Komatsoulis, the National Cancer Institute’s deputy director of the center for biomedical informatics and IT, and acting chief information officer, said through the Cloud Atlas Genome project, NCI is trying to reduce costs to access data by medical research institutions and to make sharing of the data and collaboration easier.
He said the Cloud Atlas Genome program will gather and analyze the data of about 11,000 people with cancerous tumors to determine if there are similarities in how different types of cancer, whether lung or breast or colon, grow, change and react to treatments.
“The only way we will be able to figure this out is to look at a large number of tumors, do sequencing of that patient’s DNA, as well as other characterizations of those cells, and that will give us an idea of how the various tumors in particular locations are related to each other, and we can correlate that with how well a patient responds to a particular treatment and really advance the ability to get better treatments for cancer,” Komatsoulis said. “What that means is we will generate a lot of data. By the end of this fiscal year, we will have generated 2.5 petabytes of data from this individual project. Well, 2.5 petabytes of data, just to store a copy of it and ignoring everything else, the analytics or anything, is a $2 million a year proposition, and it’s not a particularly efficient proposition if you have a couple of hundred research institutions who all want to keep a copy of this information.”
Additionally, downloading 2.5 petabytes of data across a 10G Ethernet connection would take a minimum of 25 days, and that’s without any latency or anyone else using the connection, Komatsoulis said.
These two challenges — the size of the database and cost to maintain and use — forced the NCI to think differently about how they could make this data available.
“We concluded we needed a computing environment or a cloud that co-located the data, as well as computing resources, servers and the like, and had an application programming interface, so we could have secure access to the data and resources available,” Komatsoulis said. “So that smart graduate student doesn’t download the data. Rather, he or she writes a piece of software, uploads it to this cloud, runs the data and brings back the results.”
Through this BAA, NCI will not try to build something but borrow a page from the Air Force and award three vendor pilots and have a bake-off.
Each winner will receive funds to build a prototype using similar data and then let scientists vote on the best one.
“The project itself is set up into three phases: An initial phase of six months, a second phase of nine months, at which point the clouds are due to be delivered, and the final nine-month phase is evaluation phase where we will make it open to the research community and let them try things out and vote with their feet,” he said.
This project is the NCI’s first real venture into the public cloud. Komatsoulis has been using private clouds for some time, including for the Cancer Genomics Hub, which is a repository for storing files from high volume genome sequencing and sits on an infrastructure-as-a-service cloud instance.
The second private cloud NCI is using is for its clinical data management system, which is running on a software-as-a-service set up.
“We haven’t moved nearly as much into the public cloud space yet. This is one of the things we will continue examining over time, whether or not our various IT needs can be best done in the cloud, public or private, or in-house,” he said.
Komatsoulis said NCI has been buying email-as-a-service from its parent bureau, the National Institutes of Health, for many years.