Planning for artificial intelligence should start with the use cases. How might it benefit the organization? Who will it affect? Where can we get some quick wins?
After that, think about the data strategy to train algorithms and support ongoing AI deployment. Without a solid data strategy, an AI project could fail because of erroneous output.
With respect to use cases, “the overall picture of artificial intelligence is there’s a lot of fear about how it’s going to replace jobs. I think that fear is largely unfounded,” said Frank La Vigne, data cloud services global leader at Red Hat, on Federal Monthly Insights – AI/ML Systems. “Just the arc of history shows that automation tends to bring about more jobs than it takes away. Ultimately, AI is really about augmenting human intelligence.”
With respect to data, “it’s the deluge of data which is the real problem,” said Michael Epley, Red Hat’s chief architect and security strategist. The deluge means “we have too much work for humans to do. That’s really where the value of AI comes in, where we can use these tools to automate the process of sifting through that data.” AI’s pattern recognition can pull out important data requiring human attention, while vastly expanding the amount of data contributing to human understanding.
“So, it opens up options rather than closes them down,” Epley added.
The oil of AI
You can think of data as the supply chain for artificial intelligence, Epley and La Vigne said.
As a raw material, “data in its unrefined state, like oil, really can’t do much,” La Vigne said. “It has to go through a process of refinement. Our refinement is called data engineering.”
He cited large language models used in generative AI. “They consume mass quantities of information. The old phrase of garbage in, garbage out still applies,” La Vigne said. Corrupted, irrelevant or biased data will give subpar results, he said, and at worst “you can get a really biased and ineffective model.”
As an element in a supply chain, data presents potential cybersecurity risks too, Epley said.
“As we consume data, process data and do all the data management necessary to enable our AI workflows, these supply chains are getting more complex and less understood,” Epley said. “That introduces opportunities for malicious actors to taint the data, maybe intentionally bias our results.” He noted the potential for hackers to try and alter training data, in order to push results a certain way.
To keep data secure and to ensure appropriate data for training algorithms, La Vigne and Epley said, make sure you have a complete picture of the provenance of your data as well as tools you apply to refine it.
“One big part of it is understanding where your data is coming from and who’s contributed that data,” Epley said. “The same can be said about all other parts of your supply chain, how you’re managing that data through your data management tools. Filtering, enhancing or cleaning that data is going to all impact that data quality.”
La Vigne said the need for the right data with the right provenance underscores the importance of people in AI deployment.
“Good provenance and good governance are 80% people and 20% technology,” he said. Visibility into provenance also favors open sources of tools and data, he added.
“Even though there are developers scattered throughout the world,” La Vigne said, “everything is out on the table, and each part can be inspected. There are many eyes on it. That’s far better than a closed-source system, where you really have no idea what hands have touched the code.” He said government agencies should have visibility all the way to source code of AI algorithms, particularly large language models.
“All of our engineering at Red Hat is based primarily on open source,” La Vigne added.
With use cases chosen and data curated and engineered, agencies can borrow techniques from DevSecOps and apply them to MLOps, or machine learning operations. Techniques include integration of control points in the pipeline, and version and configuration management.
The question to answer, Epley said, is: “How do we engineer these processes to be reliable, robust and repeatable?” He said that in the production of AI and ML applications, agencies might find they need to use different data sets at different stages of development. For example, even though personally identifiable information (PII) might be required in production deployment of an algorithm, for regulatory and ethical purposes, training would use data with PII removed.
And, as with DevSecOps, it’s important to bring users in at each development increment to ensure the product is what they actually need.
La Vigne said, “I think it comes down to planning. There’s no substitute for good engineering. We can slap all the AI we want together. But unless you have a good plan, and a good execution of said plan, you can’t use AI to get away from good solid engineering foundations.”