At DARPA, long-running effort aims to help address malware hidden in digital documents

The typical user never sees the thousands of pages that detail how to craft code that ensures a digital document accepts content or that allows other functionality.

Vanessa Roberts

October 2, 2023 8:54 am

3 min read

Malware tucked away in digital documents — from word processing files and PDFs to presentations and spreadsheets — continues to wreak havoc.

To most users — employees in agencies using digital files to do their jobs, for instance — that Microsoft Word doc is a few hundred or thousand words on a “page.” But behind that single document lies code that allows the digital document to intake content or data, display it and perform other functions that a user wants and needs, explains Sergey Bratus, program manager for the Safe Documents (SafeDocs) project at the Defense Advanced Research Projects Agency.

“Standards that describe our most prevalent kinds of documents are still written in English, are still interpreted by developers. And do you know what size those documents are? Your typical document format is over a thousand pages long. And that may be just for the poor document, not the potential includes [by the user] in the document,” Bratus told Tom Temin on the Federal Drive.

Getting to the root source of hidden payloads in digital documents

Since 2018, a team of researchers at DARPA has been working on the SafeDocs program to create tools that help coders safely develop description code for digital documents as well as tools that can validate existing documents so that organizations can ensure the code beneath a document contains or provides no cover for malicious payloads.

“Over 80% of all known vulnerabilities are vulnerabilities in the code that interprets input data that does the data intake. So that’s how often mistakes happen with automatically generated code from an unambiguous description,” Bratus explained. “This should not happen. And this is the kind of tools and theory and development methodology that DARPA set out to create.”

To understand the complexity of the challenge and see why digital documents have been the target of attacks, take a look at this Library of Congress explainer created to help with preservation of digital documents: “Format Descriptions: Explanation of Terms.” The size of the possible attack surface is readily apparent.

Now, factor in the realization that cybersecurity tools such as firewalls, application proxies, antivirus scanners and the like do nothing to stop attacks using digital documents as cover, as Bratus notes on the DARPA website. He further points out that “attacker bypasses of such mitigations exploit incompleteness of the mitigations’ understanding of the data format to exploit the still-vulnerable targets.”

Creating a digital documents safety toolkit

To help organizations address the specific threat posed by digital documents, SafeDocs has developed what DARPA describes as “high assurance parsers for extant electronic data formats and novel methodologies for comprehending, simplifying and reducing these formats to their safe, unambiguous, verification-friendly subsets.”

So far, the SafeDocs team, working with partners in industry and the development community, has created 13 tools to help address document attacks. (Scroll to the bottom of the SafeDocs hub site to find them.)

The SafeDocs project set out to “create mechanized, machine-readable, human-intelligible descriptions of the data formats so that you could generate your data intake code,” Bratus said.

Currently, SafeDocs has made public five types of tools to help validate and parse digital document formats:

Three resources for PDFs
Three programmer resources for describing data formats and auto-generating parsing code
Four tools for understanding document collections and format rules
Three tools to understand behavior of existing parser code

The tools help take ambiguity out of the formatting process because ambiguity is what ultimately creates insecurity, Bratus said.

Take a PDF document, as an example. “There are about 12 to 15 ways to embed JavaScript in the structure of the PDF document. Any disagreement about where and what is embedded in the document is a way for the malware to hide whatever it might hide,” Bratus explained. “It may be malicious JavaScript. It may even be a social engineering trick of the kind, ‘Go there and type your password there.’ So an unambiguous interpretation is a secure interpretation because it allows your guard software to act with full information on what’s in the document.”

Listen to the full discussion between DARPA’s Sergey Bratus and the Federal Drive’s Tom Temin.

Sponsored by Pluralsight