Ever since the first Microsoft Word macro attack, documents have been a source of malware delivery. Thirty years later it’s still a problem. Word documents, PDFs, photographs, spreadsheets, they all remain potent delivery mechanisms for hackers. The Defense Advanced Research Projects Agency, DARPA, has for several years run a program called SafeDocs, aimed a creating documents that do not become attack surfaces. The Federal Drive with Tom Temin got an update from DARPA program manager Sergey Bratus.
Tom Temin So what is the latest? Why are we still plagued with these things? And what can DARPA feel be done to somehow make these safe, to deliver and safe to receive and open?
Sergey Bratus So, in a word, ambiguity. Standards that describe our most prevalent kinds of documents are still written in English, are still interpreted by developers. And do you know what size those documents are? Your typical document format is over a thousand pages long. And that may be just for the poor document, not the potential includes in the document.
Tom Temin So that’s not content. That is just the the software descriptor around it so that the computer can interpret it.
Sergey Bratus The format descriptor. But you’re getting a little bit ahead of us because right now, if you want to write a piece of software to interpret PDF, you would download the PDF standard. It has very recently become available for free. Previously you had to buy it from the International Standards Organization. Guess how many people made the investment before they started out creating a parser? And then you would have to work your way through a thousand pages of this document. Now, most of these pages describe validity requirements. This is what makes the document valid. So you have to check all of those conditions on all of those 1000 pages, and your interpretation of those conditions must be exactly the same as the other developers writing a competing product. Now, what’s the chance that a human programmer would never make a mistake and the two human programmers would never disagree.
Tom Temin About the same as downloaded the free version or whatever the paid version before you had that? Nobody.
Sergey Bratus Exactly. Exactly. So what this Docker program set out to do is to create mechanized, machine readable, human intelligible descriptions of the data formats so that you could generate your data intake code. Whereas with call it your parser from those definitions. And just a tidbit of information. Over 80% of all known vulnerabilities are vulnerabilities in the code that interprets input data that does the data intake. So that’s how often mistakes happen with automatically generated code from an unambiguous description. This should not happen. And this is the kind of tools and theory and development methodology that DARPA set out to create.
Tom Temin All right. And so then the alternative then looks like what you mentioned, ambiguity at the outset. How does that come into this then?
Sergey Bratus So imagine your typical PDF file. Now, hackers have a sport. They would create a PDF file that the render one way in one reader say Adobe Acrobat. A different way in another reader say Google Chrome. Yet another way in Mac OS’s preview and you’d be looking at the same file. And then the question is what’s the real content there? Now think of your antivirus. What will that kind of software see when it looks at that file? Will it see what Acrobat sees or what preview sees or what chrome sees or something completely different. Ambiguity is insecurity. Ambiguity is something that stands in the way of checking documents or even trusting documents. There was recent research from a university in Germany, Bochum, that showed that even for the cryptographically signed PDF documents such as invoices, you may in fact tamper with the document so that you would receive what appears to be a trillion dollar invoice from a company for service, and it would appear to be correctly signed by them. And yet, of course, it would be fake. So again, whenever software disagrees about the interpretation of the document, which is really just bytes or a string of bits at the lowest level there you have insecurity, there you have inability to check what’s really inside it. That is to say, inability to protect. So as we say on the Save Docs program, you can’t defend what you can’t define.
Tom Temin We’re speaking with Dr. Sergey Bratus. He’s a program manager in the Information Innovation Office at DARPA. So then you have a way, again, of automating something that is unambiguous. And how does that work in a practical sense?
Sergey Bratus So in the practical sense format experts, format creators, are the people who create the standards and who maintain them will use our tools and have used our tools to create the machine readable specification of the format. That way, any ambiguity that they might have would be worked out before the standard is published, and no ambiguity or no mistakes in the implementation would find its way into the software that is supposed to be your shield against the malicious content. Our performers filed over a hundred disambiguation edits against the PDF 2.0 Standard, which is the most recent international standard for PDF, and we are one of the most prolific reporters of such disambiguate edits. Most of them have been entered into the standard and now developers have less room to disagree.
Tom Temin And walk us back then how if you have an unambiguous document and it can be read the same way in Chrome and a PDF reader from Apple or whatever the case might be. How does that keep malware out or does it simply help the detectors know that there’s malware in there?
Sergey Bratus It helps the detectors. But a better question is how does the malware hide in something like PDF? PDF was meant to be a format that is true to the printed page, and then that format grew to encompass forms. Forms needed some checking. JavaScript was added to the format includes. Let’s say your organization does not want you to fill out forms, does not want any JavaScript to touch your computer. How would it find and remove that JavaScript from a PDF document? Well, it must have an automatic piece of software that interprets the structure of the document, finds all the places where JavaScript is permitted and removes them, and then you get yourself a safe page that just renders and does nothing else. Now there are about 12 to 15 ways to embed JavaScript in the structure of the PDF document. Any disagreement about where and what is embedded in the document is a way for the malware to hide whatever it might hide. It may be a malicious JavaScript, it may be even a social engineering trick of the kind. Go there and type your password there. So an unambiguous interpretation is a secure interpretation because it allows your guard software to act with full information on what’s in the document.
Tom Temin And what is the status now of the DARPA implementation of these developer tools and what’s going to happen next?
Sergey Bratus Well, many of those tools have been released to GitHub. They are available to the public. Chief among those tools are data description languages, the means of creating an unambiguous description of a format. And we hope the means with which all future data standards would be defined. We see a world in which maybe ten years from now, any data format and any service that takes in data would be using these same docs technologies. And then there are tools to examine the internals of documents, to trace the execution of programs that interpret documents and give the transparency to the engineer, to the researcher, to the security professionals as to what’s in the document and remove any ambiguity that might still exist there.
Tom Temin Got it. So ambiguity can be removed in existing documents with the right tools, and then you can scan them and you’ll know for sure.
Sergey Bratus That is correct. And then for the future, one would want to define, describe and implement new formats and new software that interprets those formats. From the get go, using these tools, eliminating ambiguity and eliminating insecurity, eliminating any place for the malware or for the malicious or unintended payloads to hide.