Study finds major shortcomings in Air Force processes to test AI technologies

The authors emphasized the challenges in assessing whether an AI-enabled system will work as it’s intended are not unique to the Air Force, and are common across...

The Air Force has big ambitions for incorporating artificial intelligence into warfighting. But there’s one big problem: As of now, the service doesn’t have the processes or infrastructure to test and evaluate AI with nearly the same rigor it’s long used and demanded for its less intelligent weapons systems.

That assessment is one of the key findings of a more than year-long study by the National Academies of Sciences, Engineering and Medicine (NAS), whose authors emphasized the challenges in assessing whether an AI-enabled system will work as it’s intended are not unique to the Air Force, but rather, are common across federal agencies.

One of the biggest challenges the Air Force faces is that its test and evaluation infrastructure is designed to put physical weapons systems through their paces at defined points in time, before they’ve been fielded. In that process, once a bomber or fighter has been deemed suitable and effective for its missions, it’s turned over to the operational community.

But that’s simply not how AI works, said May Casterline, a co-chair of the committee that conducted the study at the Air Force’s request. Instead, testing and evaluation has to keep happening for as long as the system is being employed in the real world.

“AI constantly needs to be retrained with new data that it sees out in the field. It responds and changes to what you’re asking it to do by being retrained with new data in a continuous retraining cycle, and you really want that to happen at the pace at which things are changing on the ground,” Casterline, who is also a data scientist at Nvidia, said in an interview with Federal News Network. “You have to have a test infrastructure that can keep up with that pace, but the current approach is much more serial, with defined milestones that take longer timelines to execute. And you just will not be able to adapt to the changes in operations as fast as you would like using those mechanisms.”

The recognition of the need to treat AI differently for testing and evaluation purposes isn’t strictly new. The National Security Commission on Artificial Intelligence also cautioned in its final 2021 report that DoD would need to adapt its approaches “so that when AI systems are composed into systems-of-systems their interaction does not lead to unexpected negative outcomes.”

But the NAS study emphasizes that producing test results officials can feel confident about will require sweeping changes across the Air Force’s test and evaluation infrastructure, including more funding to model and simulate how algorithms might behave when they encounter new data, new ways to curate and collect that data into the T&E “pipeline,” a major emphasis on understanding human-machine interfaces in the military context, and developing a workforce that understands how to test and employ AI models.

And none of that is likely to happen unless the Air Force appoints a very senior official — a “champion,” as the study puts it — who has the authority to lead those changes. That person needs to be a general officer or senior executive, the NAS panel found.

“The original questions the Air Force asked us to study were fairly narrow, but it turns out that T&E is so pervasive that when you really start to pick apart the implications of testing, evaluation and operationalizing AI, you realize how many places within the department will have to get involved and change and modify,” Casterline said. “So you really need a single person who has the responsibilities, authorities and liabilities to add rigor to the T&E approaches across the department.”

The NAS study found the Air Force and the broader DoD can likely learn lessons from the commercial sector. After all, it’s not as though no one’s tried to incorporate real-world operational data into a continuous testing and improvement cycle. Automakers, for example, are doing that right now with AI-enabled safety-critical systems with an approach sometimes called “AIOps,” similar in concept to the DevSecOps philosophy that’s used for continuous security improvements in software.

But there are major differences in the Defense space that keep those approaches from mapping directly onto military systems, so to a large extent, the Air Force and the rest of DoD will need to reinvent their testing and evaluation infrastructure to deal with the AI challenge, the authors found.

“Commercial industry has solved a lot of the technology hurdles that are going to be required. There are core components and practices that can be looked at as parts of the blueprint, and there are examples of AIOps in industry,” Casterline said. “But there are areas where DoD-specific deployments start to break that model.”

For instance, the kinds of data an AI-enabled system might ingest in the real world are likely to be in parts of the globe where it’s not easy to simply upload them to central location for further analysis, like a commercial company might.

“That breaks with sort of the traditional commercial model that backhauls everything to a cloud, perhaps, and does it all in house. The next place it gets challenging, is that there are really extreme size, weight and power constraints that are very unique to DoD because of the ruggedization and environmental conditions. There are also a lot of security requirements. And there’s also a lot of bespoke phenomenology that has to be modeled to really create the simulation capability to deal with edge cases and retraining of these models for rare events. You won’t get that from commercial, but you can certainly look at them as an example and then have the department and the industrial base invest in those gaps to make that model work within the [defense] ecosystem.”

Copyright © 2024 Federal News Network. All rights reserved. This website is not intended for users located within the European Economic Area.

Related Stories