Why Congress, DoD should focus on training data platforms to make AI tools more valuable

All artificial intelligence, whether it be for fighting coronavirus or fighting future drone swarms, currently depends on one thing: Quality training data. Good data means the difference between missing a moving target and hitting it squarely between the eyes. And developing good data requires a training data platform (TDP), software designed to manage vast amounts of data so that it can be read by AI systems.

While data scientists across the government and private sector know this well, it is imperative that Congress and senior military leaders understand it, too, because collecting and preparing quality training data takes time and money. Allocating time and money, meanwhile, requires informed leaders.

The Pentagon’s Joint Artificial Intelligence Center (JAIC) is creating a platform that will provide Defense Department data scientists access to datasets, code libraries and other certified platforms to speed development and deployment of AI-enabled systems.

The National Security Commission on AI, meanwhile, has recommended that Congress establish a National AI Research Resource that would include a searchable collection of datasets available for the development of machine-learning models for national security solutions.


Both of these initiatives are critical if this country hopes to maintain its advantage over adversaries, particularly China, because AI will likely determine which country wins in the economic realm and the national security realm.

There are three basic components to AI as we know it today: algorithms, data and computational power.

Algorithms are largely in the public domain.

Computational power is ubiquitous thanks to cloud providers such as Amazon Web Services, which allow anyone with a credit card and an Internet connection to access massive collections of high-speed computers from anywhere in the world.

Data, particularly labeled data, is the most critical and most proprietary piece in AI systems.

There are many kinds of AI, but currently the most effective – what almost everyone is talking about when they say “AI” – is supervised learning. In supervised learning, networks of algorithms, written in massive blocks of computer code, are taught what patterns they should recognize, whether it be enemy camps in drone footage or signs that a truck is about to break down.

In order to teach the algorithms what to look for, they are fed tens of thousands, even millions, of data points carefully labeled by humans. After seeing thousands of military encampments, for example, along with thousands of images that may look like military encampments, but are not, the algorithms become expert at spotting the real thing faster and more accurately than humans.

Preparing data for machine intelligence requires a TDP built to keep thousands, tens of thousands or millions of data files organized with an intuitive interface that follows many of the conventions of consumer software. It coordinates access to that data by hundreds or thousands of human labelers.

But a good TDP does much more: It gives data scientists the ability to discover bias in datasets and correct them. For example, by monitoring imbalances in the dataset it can alert data science teams of the need to collect more data on so-called corner cases, relatively rare situations that nonetheless algorithms should learn to recognize.

A good TDP itself learns what to look for in the data and pre-labels data so labelers need only verify accuracy, speeding the process. It allows for easy training of data labelers and provides quality control features that can identify labelers who are making errors and require more training. And a good TDP allows version control of datasets and creates an audit trail so that data science teams can roll back the dataset if its accuracy drifts, or spot where problematic changes occurred.

China will soon produce more data each year than the any other country, according to international market research firm International Data Corp. Thanks to its “military-industrial fusion,” much of the data collected by pervasive commercial services is available to the country’s national security establishment.

The JAIC’s Joint Foundation Center is a good start toward countering that advantage. The NSCAI’s recommendation to create a National AI Research Resource would be an even bigger step.

But not only does the national security establishment need data, searchable and accessible rather than siloed, it needs to label that data appropriately. The JAIC knows how to do this, as do other discreet teams throughout the Defense Department. But the intelligence community remains married to sophisticated labeling protocols that are not machine readable and not useful for AI models.

There are roughly 18,000 analysts in the U.S. intelligence communities, many of whom peruse carefully labeled data that has been collected now for decades. But 18,000 analysts are not enough to capture the insights coming from all of the data being collected today.

Satellites capture images of every point on earth daily. Thousands of manned surveillance flights and unmanned drones record images of video feeds from all over the world, some with a resolution as fine as a few centimeters. Fused with chatlogs, phone intercepts, radio traffic and emails, this data can give the U.S. remarkable, near real-time visibility about what is happening in the world. AI tools now exist to scan all of that data and flag anomalies, narrowing the space for human analysts to focus on.

But in order to train AI systems to do the work, a subset of that data needs to be appropriately labeled – not for humans, but for machines. Already, the Defense Department is labeling drone footage for AI. Project Maven is the best-known effort.

But the intelligence communities continue to work with electronic light tables to produce data that, while in many ways more sophisticated than standard AI data, are not consumable by AI systems. The national security establishment would benefit greatly from a machine-readable labeling protocol that would fit unobtrusively into the intelligence community’s current practice.

The pre-labeling feature of a good TDP could be adapted to take human labeled data from the intelligence community’s electronic light tables and pre-label it for AI systems, making use of the decades of legacy data labeled for human analysts.

Quality labeled datasets are the key to the accuracy of AI systems. The national security establishment needs a uniform labeling process to ensure datasets meet quality standards that will make U.S. AI systems as accurate as possible. Congress should adopt the NSCAI’s recommendations for a National AI Research Resource in the 2020 National Defense Authorization Act and standardize data labeling across the US government.

Manu Sharma is co-founder and CEO of Labelbox, an AI platform development company, and an aerospace engineer.