Effective AI depends on an effective data ecosystem

Insight by Guidehouse

Federal Insights

Effective AI depends on an effective data ecosystem

The utility of combining data sources runs into laws and regulations designed to protect privacy and security, noted Guidehouse partner Brian Jones.

Tom Temin

October 27, 2023 12:09 pm

5 min read

If the ‘80s were the era of the database and the 2000s the era of the data lake, you could call right now the era of the data ecosystem. Two phenomena drive this. One, systems and applications generate more data than ever. Two, modern digital services, enhanced with artificial intelligence, use more data from more places than ever.

“Imagine the amount of data that’s generated in federal agencies where you have thousands of employees that are interacting with tens of thousands or hundreds of thousands of customers on a regular basis,” said Guidehouse Partner C. J. Donnelly. “It’s just incredible the amount of data that’s generated.”

The utility of combining data sources runs into laws and regulations designed to protect privacy and security, noted Guidehouse partner Brian Jones. He said these strictures are looser in people’s private lives than in their professional lives. The challenge for companies or federal agencies trying to provide rich digital services, then, becomes, “how do we put rules in place to be able to maintain privacy, but also leverage some of the exciting capabilities that these types of ecosystems present,” Jones said.

Everything from sports team preferences to shopping habits somehow get ingested by systems people opt into privately, Jones said. The government sector lags in these capabilities, but deliberately so.

“That lag actually is intentional,” Jones said, “as we start to figure out what are those second- and third-order impacts of bringing in data that traditionally has been not integrated.”

Information, not hallucination

If data sources make up the foundation of a data ecosystem, the ecosystem becomes operational with automated workflows that match the needed data for an application, but consistent with compliance requirements, Jones said.

Using the health domain as an example, Jones said, “Back in the day you used to walk into a doctor’s office and all your information sat in a manila folder on their desk. Now all of it’s in the electronic health record system.” But, he added, that’s not where it ends. The health data ecosystem may also include information from other practitioners, from surveys, and notably from people’s own smart watches and other personal peripherals.

“How do we start to bring all that to a workflow because that now becomes part of the decision-making process,” Jones said.

To help orchestrate it all, you need data scientists and engineers “who understand what’s important, what’s valuable, what’s not valuable, what you need to store, what you need to clean,” Donnelly said. “Because if you just try to ingest all the data that’s out there, it’s too much, it’s not valuable, it confuses the system and you’re not going to get the value in the results that you’re looking for.”

He said this is especially true for large language model situations. They generate new data which may or may not be useful.

Donnelly recommended asking, “What are these models creating? What data should they be using? What data are they allowed to be using? And how can you take all that information together so that you have a valuable large language model and not one that just hallucinating?”

Rather than hallucinate, agencies with well designed data ecosystems can capitalize on that Guidehouse calls organizational intelligence.

“All of this data that’s been generated, collected and analyzed from an agency or from an organization creates a sort of intelligence that only that specific agency or organization has access to,” Donnelly said. “How can specific organizations use that to the benefit of their employees?”

Work smarter

He said Guidehouse has used a combination of automation and artificial intelligence to help one large agency reduce what he called “caseload complexity” for security clearance. The process pulls in data from lengthy forms such as the SF-86 to sort cases by complexity and expected time needed by adjudicators. Administrators can channel certain cases to more senior adjudicators and generally have the insight they need to distribute the workload more efficiently.

“The end result is adjudications happen faster, they’re better, and they’re more accurate,” Donnelly said. “Individuals doing them have a better experience because they eliminate a lot of the manual and tedious work. It’s a win-win for everyone – the agency, the individuals doing the adjudications and the people that are going through the process.”

This is a prime example about how the data ecosystem can support process improvements that augment what people do and thereby improve service to constituents, Jones said.

With respect to AI in general and large language models in particular, “what that’s really doing is bringing the experience and the knowledge of our workforce together over that data,” Jones said. “What we’re doing is creating almost a ‘driver assist’ over the data ecosystem.”

He cautioned agencies to take a crawl-walk-run approach to use of generative AI and the data feeding these large language models.

“Crawl is looking at how we bring the data together and understand what data these programs are reaching into, and making sure that those data sources are credible,” Jones said, adding that the user can specify what sources a model can and cannot pull in. In the walk stage, the agency adds subject matter expertise, refines the data feeds and tests generated outcomes. In the run stage, the agency has operationalized use of the model, “trusting where the data is coming from, and trusting those algorithms behind the searches to give you the information you can then truly act on.”