Before jumping into AI, make sure your data is ready
January 6, 2020 1:48 pm
5 min read
This content is sponsored by Arrow NetApp.
When it comes to data, it’s easy for federal managers to get caught up in the hype and put the cart before the horse. Everyone wants to start implementing AI, machine learning and robotic process automation as quickly as possible to cut costs and free workers from repetitive, mundane task. But most agencies are just getting started with laying the foundation that will make those tools actually useful. And that foundation is a data strategy.
“Accessing bad data will render every AI tool useless. You have to spend that time and effort in creating high quality normalized data sets,” Kirk Kern, Chief Technology Officer at NetApp, said during a Dec. 5 panel. “There’s a lot of iterative process that’s involved as you work through these technologies. That’s where I think there are some interesting technologies that actually help at the storage layer, at the data layer, so something as simple as copy data management can be extremely useful for an analyst, or an AI practitioner.”
In a recent study released in 2019 by the RAND corporation on AI technologies in DoD, The Department of Defense Posture for Artificial Intelligence reinforces this point. The reports finds that “Success in deep learning is predicated on the availability of large labeled data sets and significant computing power to train the models.” Making data useful is one of the main goals of the Office of Management and Budget’s new federal data strategy. In fact, one of the two main components of that strategy revolves around practices, which includes helping agencies to:
Recognize and benefit from the value of the data by building a culture that values data and promotes public use;
Govern manage and protect data; and
Promote efficient and appropriate data use.
Federal agencies are all starting from different places in this effort. For Kris Rowley, chief data officer at the General Services Administration, this starts with the creation of an enterprise data strategy.
“What you really need to get started is the upper level engagement around understanding what the needs, challenges and priorities are and an accessible route to get to those data practitioners to help proving how you solve it,” Rowley said during the panel. “As you build momentum and you prove value, then you can start to think more about automation and the technology aspects around it.”
But other agencies don’t yet have a full enough understanding of their data to start identifying the challenges and priorities.
“I think the biggest hurdle is getting away from paper,” Oki Mek, a senior advisor to the chief information officer at the Department of Health and Human Services, said during the panel. “I think you have to make sense of the data before you have a data strategy. And you can’t analyze paper.”
And making sense of the data also involves filtering out corrupt or inaccurate data from the overall data set.
“It’s foundational to have quality clean uniform data that you can leverage that’s in large quantity, so if you don’t have those in place, you’re not going to be able to trust the outcomes that you have with the AI,” David Maron, a statistician at the Department of Veterans Affairs, said during the panel.
It all comes back to having usable data, and being able to access and trust it.
“If you look at any successful AI project, It’s highly contingent on having access to massive amounts of data and information,” Kern said. “The interesting aspect of that, however, is that information can be housed in multiple forms: it can be structured, unstructured and even object technology now. So you have these three dissimilar technologies, all needing to come together in order to generate a successful return on your investment for an AI project.”
Kern said he sees many customers trying to consolidate those three kinds of data together into data lakes, or even data oceans. But then they run into the problem of compliance and restrictions. What happens when you have personally identifiable information, healthcare data, and maybe even classified data to deal with?
“It becomes a challenge from a metadata perspective in terms of access control and how do you manage that data in those lakes,” Kern said. “And so what we’re starting to see now is an emergence of a technology called a polystore. Polystores lets you store all that information in its native format but then aggregate the metadata at a point that is searchable and accessible with the proper access controls.” Using that concept, NetApp has developed a Data Pipeline for AI systems that not only aggregates data in its different forms but unifies access to data at the edge where it is produced or received, organizes and processes data in the core or enterprise and transports or manages data in the clouds. This technology gives Government agencies the advantage of using AI technology to improve mission success because the engineering behind data management at scope and scale is prebuilt. AI/ML resources take advantage of the Data Pipeline to produce intelligent outcomes from the information.
Kern compared the current state of AI to cloud about a decade ago. Once the National Institutes of Standards and Technology published its architectural framework for cloud computing, standardization and interoperability followed, until terms like infrastructure-as-a-service became mainstream. He said NIST recently put out a request-for-information to begin putting together a similar framework for AI.
“I would encourage everyone if you haven’t started to follow NIST, start to follow that AI framework, because that’s going to be pivotal on how we generate better return on investment for any AI projects that you might want to engage in in the future” Kern said.