Federal data scientists try to find essential truths in a big and messy sea of data

Technology

Federal data scientists try to find essential truths in a big and messy sea of data

One authoritative database reports that 4% of all scientific research published last year had to do with COVID. And that produced lots and lots of data. Now the...

Tom Temin@tteminWFED

November 5, 2021 12:33 pm

7 min read

Best listening experience is on Chrome, Firefox or Safari. Subscribe to Federal Drive’s daily audio interviews on Apple Podcasts or PodcastOne.

The COVID-19 pandemic produced a bumper crop of almost everything it touched, including research. One authoritative database reports that 4% of all scientific research published last year had to do with COVID. And, you guessed it, that produced data. Lots and lots of data. Now the Pacific Northwest National Laboratory is trying to find meaning in a murky sea of data. With more on the project, data scientist Neeraj Kumar joined Federal Drive with Tom Temin.

Interview transcript:

Tom Temin: Mr. Kumar, good to have you on.

        More endpoints, more data — how can the government keep both safe? Find out in our latest ebook, sponsored by Tanium Federal and Carahsoft.

Neeraj Kumar: Thank you for the invitation.

Tom Temin: First of all, how did this project end up at the Pacific Northwest National Laboratory of all places?

Neeraj Kumar: Yeah, great question. As you know, when the pandemic started, we all started learning about what can we do through data science, through the knowledge that we already know about SARS-CoV-2 or their related coronaviruses. And one of the things that we do at Pacific Northwest National Laboratory along with other national laboratories that we have a [Energy Department] base user facility, which can be used to achieve fundamental understanding of various things which may not be available in the private sector. So the way it works is we try to gain knowledge from the existing data, that – how can we use our artificial intelligence computational modeling tools to understand the kind of a protein structures, you know, involved in SARS-Cov-2 genomes. As you know, SARS-Cov-2 is made up of 27 different proteins. Understanding the function of those proteins, how the inhibitor binds, how to develop sort of fundamental understanding of antiviral candidates, which can act as inhibitor for those protein – was the first step for making progress in that direction. So that’s how we were able to bring the data science along with experimental validation projects to the PNNL.

Tom Temin: And how did you obtain the data? Did you just simply get it from open sources and then ingest it into the lab?

Neeraj Kumar: Yes, we collected data from a public platform, along with – we did generate synthetic data using our artificial intelligence machine learning tools. It’s more called generating … or a similar 3D structures of [a] therapeutic candidate that can be screened against the given targeted protein, which is of interest. So to answer your question, we did gather data from public platform, but we also generated here at Pacific Northwest National Laboratory.

Tom Temin: And do you store it locally? Do you store it in your supercomputer? Or is it in a commercial cloud all of this data?

Neeraj Kumar: We do have high-performance computing here at the laboratory. So we store all of our data in [a] supercomputer. And at the same time, we have a DOE base user facilities, which is like directly funded high-performance computing from DOE. And we use massive computing resources, not only to store data, but to run our models, perform physics, we use modeling to understand that data.

Tom Temin: And you’re trying to learn more about the proteins and the meaning of all of this. And to get some answers about how it’s all constructed. You’re a data scientist, how do you know what questions to apply to this data? Because those would seem to need to come from someone medical.

        Read more: Technology

Neeraj Kumar: Right. The project that we were working in, we were collaborating with experimentalists who are running some experiments in the laboratory, as well as some medicinal chemist who were solving those characterization of the protein or trying to predict the 3D structure of the protein in the lab. So in the beginning, we knew there is existing 3D structures, but not for the SARS-CoV-2, for other SARS-CoV proteins, so we can utilize that information in order to predict the structure of critical candidate for SARS-CoV-2. And those were the questions we had in the beginning. How can we use efficiently our data scientists, data science tools to predict the structure just from the sequence, you know, because at that time, in the beginning, we did not have the 3D structure that can be used as a protein target to design anti-viral candidate, you know, at least computationally, as well as using some of our data science tools.

Tom Temin: We’re speaking with Neeraj Kumar. He’s a data scientist at the Pacific Northwest National Laboratory. Were you able to get answers in time? Because now the vaccines are out, and they are predicated on knowing something of the mechanics of all of this to be able to work.

Neeraj Kumar: Yes, we were able to get some fundamental understanding of the target protein that we were working on. One of our [goals] is always when we develop data science tools or modeling tools to accelerate scientific discovery through fundamental and applied science that can help public and private sector to utilize those tools. But at the same time, kind of really trying to understand how that protein function – what is its role within the genome? Or when it interacts with the other protein courier, how does it function? So that was our goal to understand the function structure and dynamics of those protein while making antiviral candidates. So I think we got to the point where we were like, “Oh, we understand this particular protein target through this research.”

Tom Temin: And from a computational standpoint, was the data that you gathered mostly complete, was it mostly in a format that was usable, or from what I’ve read of the project, it was kind of messy, and you had to do some cleanup.

Neeraj Kumar: Data is always messy. You can’t use the raw data collected from public platform or coming from instrument or coming from measurement. You have to do a data cleaning, data managing, in order to make it usable for the machine learning tools are a modeling tools that you have been trying to train with that data. So there is a various steps involved in cleaning the data in order to use it the way we would like to use it for the AI tools. So that’s the expensive step in the data science. If you don’t understand what data type means and how you can represent that data type and extract the knowledge from the data through machine learning, then you won’t be able to answer the question that you have in mind while building those or using those data from different platforms.

Tom Temin: And what is the status of the project at this point now that again, we are fairly late in the game here in the vaccines, the fight is over who takes it and who doesn’t take it?

Neeraj Kumar: So we are in a stage where we were working on to understand really the fundamental aspects of the protein involved in SARS-CoV-2 and their function. I believe we have made significant progress in understanding their function, how it interacts with the small molecule candidate. But there are also other pieces that we were working on, kind of a – there are dynamics using long-range moleculer modeling simulation, which utilize high performance computing that we have available through DOE platform, which may not be available in a public platform. So that’s one of our [goals]. And those simulation modeling tools can always be used to build future data science tools that could be utilized when something like this happens in the future. So be prepared for the future. You never know what are the surprises. So that’s where we are.

        Sign up for our daily newsletter so you never miss a beat on all things federal

Tom Temin: And it looks like you have learned a lot about the applications of artificial intelligence in this whole field.

Neeraj Kumar: Yes, the application of artificial intelligence is humongous in every aspect of medicinal chemistry, computational chemistry, it basically helps you to expedite day-to-day [jobs] while trying to understand the protein structure function or other sorts of things. It cannot solve everything but make your jobs a little bit accelerated in a way where you can make a decision on the fly or make your tools automated so that you are not doing the routine work.

Tom Temin: Let’s hope if SARS-CoV No. 20 comes along, we’ll be prepared.

Neeraj Kumar: Absolutely. I think we would be better prepared if something like this comes in the future because lesson learned. We will have more data, very well curated data to extract knowledge, to extract pattern, to build robust model to make a decision in terms of therapeutic design, not only computationally through data science tools, but also validation through robotics system in the experimental lab would expedite things in the future.

Tom Temin: Neeraj Kumar is a data scientist at the Pacific Northwest National Laboratory. Thanks so much for joining me.

Neeraj Kumar: Thank you very much for having me.