Soldiers have long talked to their weapons, mechanical things that become almost part of you. But talking to weapons will take on a whole new meaning under a project underway at the Army Combat Capabilities Development Command (DEVCOM). Researchers there are exploring how natural language processing can help soldiers interact with robotic and autonomous platforms expected to become more important in future combat situations. For how all this could work, Federal Drive with Tom Temin turned to DEVCOM researcher and computer scientist Felix Gervits.
Insight by RavenTek: Explore how infrastructure visibility is the first requirement for maintaining best performance in this exclusive executive briefing.
Tom Temin: Mr. Gervits, good to have you on.
Felix Gervits: Thanks, Tom. Thanks for having me.
Tom Temin: So the goal is for what to happen here? Actually, what is the outcome that you’re working toward – let’s start there.
Felix Gervits: The ultimate goal of the research is we’re trying to support a concept called Heads Up Hands Free. And this is a concept in which speech kind of serves as the main interface in teams of soldiers and robots. So for example, instead of using a joystick to manually control a robot, a soldier can give it voice commands using what’s known as natural language. So basically, this means unrestricted communication, right? So soldiers can phrase their commands to the robot in any way. And they don’t have to memorize specific instructions, or read any kind of operating manual to learn how to interact with the robot. By freeing up the hands, this can greatly improve soldier safety, that’s the ultimate goal. It’s kind of like if you’re trying to cross the street, and you’re looking at your phone, it’s extremely distracting. So this is more so in combat environments. So we’re kind of trying to reduce the need for that and enable this kind of speech-based interface.
Tom Temin: Interesting. So is there anything commercially out there? We have all these voice recognition systems for different devices, from various big tech vendors? Is any of that technology applicable here? Or are you going in something totally new direction?
Felix Gervits: Yeah, that’s a good question that we actually get a lot. So we’re going in a different direction. Commercial technologies, some of what you might be familiar with, they’re more aimed at helping people to perform simple tasks. So things like setting your alarm clock or ordering stuff online. In these systems, communication is usually one-directional. So the person gives an instruction and the system carries it out. On the other hand, our system is designed to enable more back-and-forth dialogue between soldiers and robots. So specifically for collaborative applications. So this can support more complex kind of dialogue. So things like feedback and clarification requests and things like that, that the commercial systems are not designed to support.
Tom Temin: And what is the essential technological challenge then to overcome in devising this kind of – I guess it’s speech recognition or processing?
Felix Gervits: Yeah, I mean, there are quite a few. So one of the challenges is collecting naturalistic language data in military-relevant tasks, so that the system can actually use these data. So what the system actually does is it has to interpret the intent of a soldier. And to do this, the system has to be trained. So this means basically exposed to hundreds of example utterances that capture what someone said, and then map it to what they meant. So in other words, what the robot should do, like ask a question or perform a task or something like that. And since we’re interested in dialogue for military applications, we collected our data from people performing a search-and-rescue task. So as you might imagine, it’s quite time consuming and costly to collect, and process such data that the system can actually learn from it. So that’s one major challenge. In terms of the technical hurdles. I mean, the big one, like you said, is natural language understanding. So the robot needs to interpret the intent of a soldiers commands, and then translate that to an action that it can perform. So you know, in the case that the command can’t be done, you know, it’s unactionable for whatever reason, then the system needs to explain that to the soldier, explain that it can’t do the action, and then perhaps give an explanation as to why. This can be straightforward for basic commands, so something like you tell it to move forward 5 feet – it can probably handle that pretty easily. But it can get complicated when the soldier refers to, for example, landmarks in the environment. So a command like, turn left at the second intersection, for example, is a pretty hard one. So we’ve overcome some of these challenges, but others remain active areas of research.
Tom Temin: And I imagine another challenge as this would be deployed in the future is that every platform is somewhat different. So if something has tracks, if something has walking legs, if something has small wheels, they can do different things in different terrains. And so there would have to be some sort of a device-specific processing system for each individual platform.
Felix Gervits: Yeah, that’s definitely the case. I mean, right now, we’re mainly focused on robots, although, as you said, there are different kinds of robots like humanoid robots that can perhaps grasp objects or autonomous vehicles that can drive in different kinds of terrain. So yeah, that is, again, one of the challenges is kind of link the capabilities of the system to each particular platform that we use.
Tom Temin: We’re speaking with Dr. Felix Gervits, he’s a researcher and computer scientist at the Army Combat Capabilities Development Command. And I guess, different capabilities for sight, for example, that is machine vision would come into this. Suppose you tell something to move 10 yards ahead, but there’s something in the way. And you would want to tell it will go around that rock and then keep going. For example, I’m just making this up. But that would be a lot of processing between that and just simply moving forward 10 yards.
Felix Gervits: Yeah, absolutely, that is a definitely a challenge. And the robot will need to create some kind of map of its environment, which it can do through various sensors, and then integrate that into its kind of interaction model. In the ideal case, the robot will be able to know that there is some kind of obstacle ahead. And if you tell it to move forward, it might say that it can’t do that, because there’s something blocking it. And, perhaps it can even come up with an alternative plan to get around that object. But I mean, this is very much work in progress. And it’s a very difficult challenge that nobody has really solved in the field.
Tom Temin: Sure. So how do you go about the day-to-day work of solving it? That is to say, do you bring in actual soldiers and have them talk to these systems? I guess it’s sort of a breadboard-type of situation at this point. I mean, how do you actually conduct the research itself?
Felix Gervits: There are a lot of parts to the research, actually. We have done some work with soldiers in the past. We also recruit typical participants from college campuses and elsewhere. One part of the research is just collecting the data, like I said, to train the system. So we have people performing various kinds of tasks. In this case, it was a search and rescue task. And some of the participants, I think a whole 20% of them were actually soldiers. So we did get some of that data. And we’re basically learning how people talk to robots, how people construct robots, and then training the system on that kind of data. So that’s one part of it. And then, another side is the actual implementation part, designing the algorithms that can support all this kind of dialogue processing. So natural language, understanding dialogue, management, text to speech, and various other technologies have to all come together. So there’s a major kind of engineering challenge there as well.
Tom Temin: Plus, you also have the environment that you would typically be working in, which could be noisy, it could be a lot of spiky noise of weapons, fire, and so forth. Very different from when someone’s in their kitchen, talking to one of those stupid hockey pucks.
Felix Gervits: Yeah, absolutely. So speech recognition is the first step in the process. That’s kind of critical, it enables the robot to receive the speech inputs and understand the intent. So in noisy environments, like you’d see in the military, this is definitely a challenging problem for any system. So, one way around this is potentially to use a headset to isolate speech, so that you’re kind of ignoring most of the noise, the external noise. Fortunately, this is not uncommon in military environments anyway, for people to use headsets. But then you also get, other challenges – things like poor network access, bandwidth limitations, and in some cases, even contested networks. So often these systems have to make do without any kind of internet access, or with only limited access, which, obviously creates additional challenges. Given that some of the component technologies are cloud-based and rely on an active internet connection. Another challenge is also being able to stream video data. So this makes it difficult to establish a shared understanding of the environment, for example, if teammates are remotely separated, and need to exchange information. So this is why back-and-forth dialogue is so useful to kind of assist in this process.
Tom Temin: Is there any similar research going on in environments that are sort of like military, I’m thinking, say, airport ramps, and ground facilities near terminals, or inside of factories, or warehouses where there’s autonomous vehicles moving here and there? Is there anything from that end of industry that is remotely applicable here?
Felix Gervits: There is definitely a lot of work in the robotics space in these kind of industrial applications and airports, air control, things like this – kind of the the unique attribute of our research is that we’re focused on this natural language interaction. So being able to interface with these machines through speech. So a lot of times you don’t see that. A lot of times you get more kind of scripted, how do you program a robot to, perform this assembly task, which is not a trivial problem at all. But it’s a different problem space, than how do you actually communicate with a robot and get it to understand your intentions and then carry them out? So in that sense, our work is more related to some stuff happening in academia at the various universities.
Tom Temin: And where are you in this research? I mean, could something be deployed in 10 years, five years, next week?
Felix Gervits: It’s hard to get a sense of the of the timescales here. We’re working on very preliminary kind of basic research, kind of developing the technologies that could enable such systems in the future.
Tom Temin: And just give us a brief description of the system that you’ve built so far that you’re working on here.
Felix Gervits: The system that we’re working on is called JUDI, or it stands for the Joint Understanding and Dialogue Interface, and it’s what’s generally called the dialogue system or a conversational agent. So the idea is that it can be embedded in various devices. The current application is, of course, robots. But it can apply to, you know, smart sensors and basically any kind of computer system. And the idea is that JUDI is designed to process spoken language instructions that are directed to a robot, derive the core kind of intent of the instruction from the words that were spoken, and then decide what the robot should do. So effectively it enables robots to talk and act by managing the communications between robots and soldiers. And JUDI uses a technique that we call intent retrieval. So this iss originally developed by our collaborators at the USC Institute for creative technologies in California. And the brain of this intent retrieval component is what’s known as a statistical language classifier. So this is what receives the speech input and determines how to respond. So kind of a good analogy, a good way of thinking of the classifier is kind of like a language translation system, which converts from one language to another. So for example, something like a Spanish-to-English translation. But in this case, the idea is that we’re translating from the language of input commands to system responses. So as an example, you know, if the system gets a command from the soldier – I don’t know, like, turn around or something – the idea is that it finds the best translation to a corresponding response to that command, which in this case, might be the action to turn the robot around. But in other cases, it could be a clarification question or something else. So again, because the classifier is trained on the data from the search and rescue task that we perform, it’s able to process new commands that hasn’t encountered before in the training data. And that’s kind of what makes it effective.
Tom Temin: And of course, again, that gets back to platform specific. You would have to have an instance of JUDI, depending on what the capabilities and operations that the robotic platform can do in the first place.
Felix Gervits: So the platform itself is generally domain independent, in terms of the components and how they’re connected. What you do need for each different platform or different environment is perhaps different training sets, or you might need to build on top of the existing training data. So if you’re trying to get a robot to pick up objects in a kitchen or something, training it on a search and rescue task is obviously not going to allow it to do that. So you might need to expand your training domain a little bit more for that.
Tom Temin: And briefly, the search and rescue that you have been kind of using as the test situation, what does that mean in terms of a robot, because you think of search and rescue as soldiers going and grabbing somebody by the shoulders and getting them the heck out of there.
Felix Gervits: Yeah, I mean, so this is an example of a kind of dangerous environment, where you may not want to send people into. It could be a war zone, for example, or something, or some kind of, you can imagine a disaster relief scenario, where it could be radiation or other hazards. So sending a robot is ideal because it costs money, but it’s nowhere near human life. So the robot can kind of search around the environment and communicate back to a remotely located commander. And they can kind of engage in this dialogue to kind of build up a shared understanding of the scene and help to maybe locate survivors or whatever targets it’s searching for.
Tom Temin: And just a personal question, you are relatively young, you have a pretty heavy duty academic background in some of the top schools – University of Pennsylvania, Tufts and so forth – what motivated you to want to work on military applications? And what’s it like working in the military situation?
Felix Gervits: I find it really interesting. To be honest, it’s very similar to what I was doing in academia, in terms of being funded by some kind of DoD organization in a university is not dissimilar from working for that organization specifically. I like that I’m focused on basic research problems that can help enable these technologies in the future.
Tom Temin: On some level you must feel an affinity for the military mission?
Felix Gervits: Oh, yeah, absolutely. I think it’s extremely rewarding, I think to be able to work in this problem space, and help solve problems that are of relevance to national security of the country. I take pride in that.
Tom Temin: Felix Gervits is a researcher and computer scientists at the Army Combat Capabilities Development Dommand (DEVCOM). Thanks so much for joining me.
Felix Gervits: Thanks, Tom. Thanks for having me.