USACE data scientists enabling AI, analytics across Army

In order to do advance analytics or adopt AI capabilities, agencies first need their data in order. USACE is helping various Army programs do that.

August 14, 2024 3:03 pm

5 min read

Federal Insights - USACE data scientists enabling AI, analytics across Army

Download audio

Federal agencies are all trying to figure out the best ways to incorporate generative artificial intelligence into their operations, because large language models (LLMs) are enabling federal employees to conduct back-of-house business processes faster and better. The Army is no different, which is why computer and data scientists at the U.S. Army Corps of Engineers’ (USACE) Engineer Research and Development Center (ERDC) are working to enable those AI capabilities across the Army.

“Large language models are really hot right now. And everyone wants them for a variety of purposes. So we have a group of people who are collecting regulations or guidelines, best practices within certain communities in our organization and putting them within an LLM, optimizing that LLM using those regulations, and then being able to ask that LLM questions to help teach or inform bodies of people,” Cody Salter, a research mechanical engineer at ERDC, said on Federal Monthly Insights — Unleashing Data Insights to Drive Government Innovation. “That’s really an incredibly common task that we see ourselves doing right now for a variety of different collaborators and customers, because that technology in and of itself, that LLM, is just really hot and widely publicized right now. So people are really able to latch on to that. They understand what it is and what it does and how it might be leveraged within their space, so we do a lot of that.”

LaKenya Walker, a computer scientist at ERDC, said there are a number of use cases ERDC is working on. On the civil work side of USACE, document summarization is a common ask. So is knowledge generation — using generative AI (genAI) to generate information for knowledge databases to support training and upskilling newer employees. That’s particularly important as agencies struggle with the loss of institutional knowledge due to a significant amount of the federal workforce aging toward retirement.

Meanwhile, on the military side, Walker said she sees a significant amount of interest in information summarization. Military systems capture an increasingly large amount of data, too much for human experts to understand and parse — so the Army is looking to AI to supplement them in that task.

Walker said there are two methods for preparing an LLM to work in such a specific domain as the Army’s data. First, they can use the domain-specific data to fine-tune the LLM’s results, but that’s more laborious because all of that data has to be generated. Or second, they can layer retrieved augmented generation (RAG) over the LLM to query specific knowledge or vector databases.

“So in the RAG, what you’re doing is actually taking your source data, embedding that data, or vectorizing the data and storing it in a vector database,” Walker said on The Federal Drive with Tom Temin. “So instead of your LLM using its training data that’s inside of its parameters, it looks to your vector database with your domain specific embeddings in it to make a judgment call, to give information back.”

So if the Army wants, for instance, to build a predictive model for the failure of a specific part on a vehicle, it would feed maintenance records into an embedding model, which vectorizes the data. Then, when someone queries the LLM, the query goes first to the RAG module, which pulls the relevant vectorized data. It adds that context to your query, then feeds it to the LLM for output.

Walker said genAI capabilities also allow users to manage the output from the LLM, to be presented in the format that’s most useful or comprehensible to them. That could mean tailoring the output to a less technical audience, avoiding the use of highly specific jargon. Walker said the Army tends to particularly favor output in the form of web-based tools like dashboards.

But ERDC doesn’t involve itself in the output or visualizations that often. Instead, Walker and Salter said they focus more on getting their internal customers set up with the tools and infrastructure they need to do it themselves.

“Generally, we have been in the business of providing capabilities at the front end of the data science lifecycle, if you will: the data infrastructure, getting data in a format that’s conducive to doing analytics, to provide an environment for folks to come in to do the analytics and the visualizations to support their use cases,” Walker said. “Generally, we’re not the ones at the latter end of that life cycle pipeline where people are doing the visualizations. We allow users from different organizations coming in do their analytics and perform their own type of visualization development.”

One challenge they run into is that their customers often need multiple types of data to reach the results they need. That can mean translating outdated code from legacy applications into formats that are proprietary, or support analytics in general. Or it could mean repurposing older data based on the use case.

“The data source is really dependent upon the need, like the analytic need. So when we sit with a customer or a collaborator, we always want to start with a really basic understanding of ‘what is the problem that you’re trying to solve?'” Salter said. “And that is what’s going to then kind of dictate out to us the various data sources that might be required in order to answer that problem. And then, even furthermore, you may need it, but it may not be available to you. So what’s the need? And then what’s available to you to answer that question? And hopefully those two things align, because if they don’t, then there could be problems.”

But working on the front end, on the infrastructure and data formats, also allows ERDC to be more versatile. Salter said skills, tools and methods transfer across specific domains and disciplines.

“We really think that data problems have an agnostic nature to them, like a data problem is a data problem is a data problem, no matter what the kind of domain or application it is,” he said.