Insight by Cloudera

Getting the most out of your cybersecurity data requires normalizing and a tiered storage approach

Cybersecurity has become a data-driven activity. Predicting, responding to, mitigating and post-facto analysis of cybersecurity events all rely on having the applicable data available at the right time. Data itself originates in a variety of sources including network device logs, cybersecurity applications, user traffic and third-party threat intelligence.

That variety presents two challenge for cybersecurity practitioners. One is that varying formats make data difficult to mine. The other is that data storage in multiple silos make use of data more difficult than it has to be.

By unifying data into single stores or lakes, using open standards and applying tools to overcome format differences, organization can speed security operations while also preserving what might be called the “long context” – the trends that operators can tease out of big data stores.

Carolyn Duby, principal solutions engineer and cybersecurity lead at Cloudera, put the problem this way: “Each company or organization has a plethora of security products. The problem is that each one is great at doing what it’s good at, but there’s no way to get the overall picture of what is going on, in all of these different products. They all have slightly different interfaces, they all present data in a slightly different way.”

The challenge becomes, Duby said, “How do you take all that data, crunch it up, and then put it into a manageable, scalable platform?”

She said a hybrid cloud-data center platform solves a couple of problems. Agencies are reluctant to put all of their data in a cloud because of egress charges. But by policy and practice, they want to minimize the capital investment of ever-expanding storage infrastructure. Therefore, a best practice, according to Duby, is to treat the elastic cloud and the data center as a single, tiered storage system. Store the “hot,” recent data you need for immediate use locally, in the data center.

“And then,” Duby said, “you can push it up to the cloud [to] keep a very long context.” She noted that many of the cybersecurity tools are simply not scalable enough to retain sufficient data for long context analysis.

“The hybrid configuration would kind of give you the best of both worlds. You’ll have your hottest, most recent data on premise. Then, with fast access, you put your data on your entire context up into the cloud,” Duby said. In subsequent trend or after-action analysis, the agency can most the compute resource to the data, rather than moving the data back down to the analytic application.

With respect to rendering data from multiple sources interoperable, Duby calls that a challenge of log ingestion. “When the logs come in, they come in a plethora of formats. Larger organizations, they have literally thousands of formats.”

She recommends a two-step approach to normalizing data. First, establish a common schema for the data. Then, “pull out the relevant bits from each of those formats, and then put it into the right part of the schema.”

Duby added, “So basically, we’ve got to take all of the information of the logs that are coming in, and normalize them into a schema using an ontology, which is really the data model underneath.” The point of this exercise is to render data discoverable, and, with a consistent format, useful to a given analytic tool. Data stripped of its proprietary format is also more sharable among agencies, she noted.

To ensure their cybersecurity-related data activities, Duby said, agencies must establish security and governance policies for the security-related data itself.

“What we want to be able to do,” Duby said, “is create a security and governance layer that works across multiple parts of the workflow as well as across when the data. When the data travels, for example, when we push it up to the cloud, we want to make sure that all of those security and governance policies travel with it.”


Each company or organization has a plethora of security products. They all present data in a slightly different way. We really want to be able to bring all of that data into a common location, and normalize it into a consistent format.


You can have a hybrid solution where you have your most recent data in a data lake on premise. Then you can push it up to the cloud [to] keep a very long context, because that is a major problem in the cybersecurity defense. A lot of the tools that we're using not able to store that long context of data.

Listen to the full show:

Featured speakers

  • Carolyn Duby

    Principal Solutions Engineer, Cyber Security Lead, Cloudera

  • Tom Temin

    Host, The Federal Drive, Federal News Network

Sign up for breaking news alerts