Researchers at Georgetown University’s Center for Security and Emerging Technology have been gathering all the open source information they can on the coronavirus and the illness it causes. They’ve created a public repository, known as CORD-19, that even the White House and federal agencies are using to better understand what’s going on. For details, Federal Drive with Tom Temin turned to the center’s director of Data Science, Dewey Murdick.
Tom Temin: Dr. Murdick. Good to have you on.
Dr. Murdick: Thank you so much, Tom. It’s really a pleasure.
Tom Temin: Tell us. First of all, about the Center for Security and Emerging Technology.
Dr. Murdick: Yeah, the Center for Security and Emerging Technology here at Georgetown University was started in December 2018 as an effort to connect policymakers to high quality analysis of emerging technologies and their security implications. The focus primarily was on artificial intelligence to begin with. We’re a nonpartisan group where supporting academic work. But even more importantly, we’re trying to prepare a generation of policymakers, analysts and diplomats that can wrestle with the future technology dilemmas in a data informed way.
Tom Temin: Boy we’ve really got a tiger by the tail now. Tell us about the repositories for coronavirus and what is in it, and how do you get the material into it?
Dr. Murdick: Great. The COVID-19 open research data, or CORD-19, is a resource that has over 29,000 scholarly articles, 13,000 full text articles and a lot of content related to COVID-19 and it’s related coronavirus virus group. This has really formed a partnership with an incredible team. I got a call at about 5 p.m. on Friday a couple weeks ago from the White House and they gave a request. Wouldn’t it be nice to have a depository that was publicly available so people could actually start using machine learning an artificial intelligence methods to help answer important questions about this very dynamic disease? And so I called up the number of people, people who I’ve known to be unselfish. People who have technical resource is that are relevant, and together we scoured what was publicly available and put together a first collection that was released on Monday.
Tom Temin: This is all unstructured data. in other words?
Dr. Murdick: Well, normally it would be, but thanks to some work by Allen Institute for Artificial Intelligence, they have a software package that has been able to take all this really unstructured data and provide a nice initial structuring of it so people can use the data to start testing and developing their algorithms. So it’s technically unstructured in the sense that there’s lots of big sections of text. But it also has metadata, such as about the title about who authored it and other things like that.
Tom Temin: And so someone wanting insight into what is going on and they go to the repositories, what do they do other than simply search through what comes up?
Dr. Murdick: Yeah, so there are already available resources where you can just search for articles and humans can interact with them, read them and the like. Which is a fantastic resource and really should be used by by people. The problem, however, is most of these data repositories aren’t set up for machines to work with. They don’t have the licensing rights, data can’t be distributed. So what we’ve done is we’ve taken all this data and put it into a unified format that takes many, many hours off the average data scientist. Time to start using the data. Usually, when people start playing with data, it takes over half or even as much as 90% of time getting the data in a form that they can use it. We’ve tried to take that cost away. And so now that they can just start and start building algorithms to answer important questions like, how long does the virus last on cardboard? How does it transmit? How does it incubate? Using all information that’s coming in very rapidly every day.
Tom Temin: And could someone also use this, for example, to ferret out false reports and misinformation? Because Lord knows the Internet is full of that when it comes to this particular virus.
Dr. Murdick: We’re primarily using scholarly literature. Clearly we know that not all scientific findings are right. They have been wrong. It’s a dynamic process of learning. So the cool thing about being able to use this data is you can check consistency against other reports. Oh, this report’s very different than what everyone else is saying either, that might mean it’s wrong, or maybe it’s right, and we’ve discovered something new. So yes, those kind of things can be discovered.
Tom Temin: But in other words, it doesn’t ingest just general news? in t
Dr. Murdick: Not right now. This data is very hard to get, but many people have access to news, but this is dealing with a lot of substantive content about what’s happening on the ground and in clinical research, trying to figure out how does COVID-19 transmit? How long does it last as a from from spit or something like that on a cardboard surface? Those kinds of questions are being actively pursued and answered regularly, and we’re hoping to be able to run over all the data that we’ve assembled so people can answer these questions more rapidly.
Tom Temin: I was going to ask. Are you monitoring the uptake of the depository and is it pretty active so far?
Dr. Murdick: Yes. Currently the data repositories hosted by the Allen Institute for Artificial Intelligence and also Kaggle. Kaggle is part of Google, and it’s a competition website. Kaggle is open to competition to answer a bunch of questions, and that first round is do the 16 of March. So this will allow us to see how many people are trying to participate in the challenge.
Want to stay up to date with the latest federal news and information from all your devices? Download the revamped Federal News Network app
Tom Temin: Sure, and a lot of the sources of material for the repository cost a lot of money normally, these scholarly journals and so forth. Were you able to get free access to it for this purpose?
Dr. Murdick: Yeah. So the first round of this data repository was based on pre-print archives, open access content and then NIH funded research that has been made publicly available, manuscripts from that research. Now, thankfully, there’s been a lot of willingness of publishers to share content that’s normally behind a paywall, data that’s not available without paying $35 an article or something like that. That content is now being incorporated, and in future versions we’ll be able to have more of that paid content.
Tom Temin: And what about the links coming up? You mentioned Microsoft Academic Graph, not even sure what that is. Dimensions and pub med and semantics scholar service’s will be linked to the repositories. What is that all about?
Dr. Murdick: So there’s wonderful engines continually churning over the internet trying to find information about COVID-19. Microsoft, Bing and their service, Microsoft Academic provides a way to access all that content in an easy search interface. In partnering with Microsoft, we’ve been able to link all the content that’s part of CORD-19 and we can now link it to Microsoft’s mining services. So now we can have all the additional information that Microsoft has invested time and energy to create into this new resource. So just basically, it translates into the fact that now we have even a more comprehensive data set.
Tom Temin: And do you envision this being useful in future situations that might be of a similar nature? I mean, in many ways, the administration is looking to what happened with Ebola, even though that’s a different piece of science than coronavirus. But the response mechanisms are not all that different. Could this have future use?
Dr. Murdick: Yes, Tom, I think you’re right. This is a new, valuable research that’s really the fruit of
unselfish collaboration from Microsoft, Allen AI, Chan Zuckerberg Initiative using research from the National Library Medicine, and this is great for COVID-19. Once this crisis has passed, we hope this project will inspire new ways for machine learning to advance scientific research. I really think that’s probably going to be the biggest impact is we develop new methods to be able to answer questions rapidly as other crises come out in the future, we’ll be able to move even faster.
Tom Temin: Dewey Murdick is director of data science at Georgetown University’s Center for Security and Emerging Technology. Thanks so much for joining me.
Dr. Murdick: Thank you, Tom. It’s been a real pleasure.