Amazon Alexa: How developers use AI to help Alexa understand what you mean and not what you say

The developers and engineers on the Alexa Smart Home team use a variety of mechanisms, including AI and ML, to help Alexa better understand your voice requests.

How does Amazon help
Alexa

understand what people mean and not just what they say? That’s the subject of this week’s
Dynamic Developer podcast

. And, we couldn’t be talking about Alexa, smart home tech, and AI at a better time. During this week’s Amazon Devices event, the company made a host of smart home announcements, including a new batch of Echo smart speakers, which will include Amazon’s new custom AZ1 Neural Edge processor.

In August this year, I had a chance to speak with Evan Welbourne, senior manager of applied science for Alexa Smart Home at Amazon, about everything from how the company is using AI and ML to improve Alexa’s understanding of what people say, Amazon’s approach to data privacy, the unique ways people are interacting with Alexa around COVID-19, and where he sees the future of voice and smart tech going in the future.

And if you’re interested in seeing what’s inside some previous Amazon Echo devices, check out my
cracking open teaof the original Echo

and the Amazon Echo Show at CES 2019.

The following is an transcript of our conversation edited for readability.

Bill Detwiler: So before we talk about maybe IoT, we talk about Alexa, and kind of what’s happening with the COVID pandemic, as people are working more from home, and as they may have questions that they’re asking about Alexa, about the pandemic, let’s talk about kind of just your role there at Amazon, and what you’re doing with Alexa, especially with AI and ML.

Evan Welbourne: Yeah. Absolutely. So I lead machine learning for Alexa Smart Home. And what that sort of means generally is that we try to find ways to use machine learning to make Smart Home more useful and easier to use for everybody that uses smart home. It’s always a challenge because we’ve got the early adopters who are tech savvy, they’ve been using smart home for years, and that’s kind of one customer segment. But we’ve also got the people who are brand new to smart home these days, people who have no background in smart home, they’re just unboxing their first light, they may not be that tech savvy. And so a lot of the work is kind of trying to support that end to end smart home experience from unboxing the light, to configuring it, setting it up, configuring your Smart Home groups and spaces, things like that.

And that embodies a number of different features, so there’s things like Alexa Hunches, which we launched a couple years ago and continue to refine. That’s where we’re kind of identifying anomalies in the home and letting the customer know about them, sort of checking that they’ve got their back door locked, that one time they forget it every few months. There’s Alexa Guard, which is another feature that’s about keeping their home safe. We built this algorithm that turns lights on and off in a home to make it look like they’re home when they’re away from home, kind of a fun one.

And then there’s this other feature, we might talk about it a little more later, that we’re calling it “Did You Mean?” And it’s really not a named feature necessarily for customers, but it’s just something to help customers get through that basic experience of controlling their smart home after they’ve got it set up. And we’re trying there to help Alexa understand what the customer means, not just what they say.

Download: Cheat sheet: How to become an Alexa developer (free TechRepublic PDF)

Helping Alexa understand what people mean vs. what they say

Bill Detwiler: I think that’s really important because the promise of artificial intelligence and machine learning is really I think all about prediction. That’s really what you’re trying to get to, is let’s feed this a bunch of data. Let’s look for connections. And then how can we do something with that information? How can we make a prediction based on that information, or a reaction, or take some next action after something happens? And so the “Did You Mean?” feature I think helps conceptualize that for a lot of, I guess, consumers that may be thinking about, “Why do I need AI and ML? How is this really going to apply to me in the real world?” Maybe talk a little bit about how that kind of has been part of your thinking as you look at AI and machine learning. How can we use this to help consumers by using the predictive nature of these algorithms take some action?

Evan Welbourne: Yeah. Yes, it’s a great question. And I think “Did You Mean?” is a great example feature because it’s pretty simple. Right? It’s addressing problems that come up with the most basic control situation in the home using Alexa, like the customer may say, “Alexa, turn on the lamp.” And what often happens is that there’s not an exact match between the word lamp and all the stuff the customer has in their registry. There may be a reading light, or there may be a red bedroom lamp, or overhead lights. We’ve got to figure out what it is that the customer actually means, not just what they said.

The 2020 Amazon Echo ($99.99)” data-credit=”Amazon” rel=”noopener noreferrer nofollow”>

The 2020 Amazon Echo ($99.99)

Amazon

And so the thing that’s so interesting about “Did You Mean?” it’s such a simple feature, and on the one hand, you can do that basic string match between lamp and reading lamp. That gets you sort of partway there. But if you think about all the different situations and all the different types of ambiguity that come up in that basic voice control scenario, you really start realizing you need that heavy hammer. You need that machine learning type of algorithm to just make that experience super simple and super natural for the customer.

But just to illustrate, a couple examples, so some things I already mentioned, like there’s kind of the synonym scenario, where they say, “Lamp,” and their device is called light, or something like that. Okay, so that’s sort of they’re using a synonym that’s very common and natural speech. Another thing that can happen because it’s a voice assistant, because Alexa is learning new languages all the time, constantly improving, there still are these speech recognition errors. Maybe somebody’s whispering on the other side of the room. We don’t get it quite right, or there’s a transcription error, and we’re just a little bit off. And so we’ve got to figure out. Okay, how do we resolve that to one of the customer’s devices and just turn it on for them?

And then there’s in regards to this sort of international expansion, which is sort of the big move in the last year, we’ve gone international, to many different languages, and company, and countries. We’ve got Smart Home in Japan, in India, Alexa speaking German, Italian, Japanese, Hindi, Portuguese, all these different languages. We’ve got to make sure that we can resolve that type of ambiguity and all of those different languages. And then importantly, across languages, that’s the new big challenge, is that in the US, there’s a lot of people speaking English and Spanish in the same home, and they may interchange English and Spanish in the same commands.

Or even more common, in India, people are naming their devices in English, but they refer to them in Hindi. So we’ve got to kind of match, do this code switching across languages to figure out what they mean. And then also, by the way, they’re speaking naturally, so they’re using synonyms. They’re specifying things somewhat ambiguously. We’ve got to kind of deal with all of that ambiguity to just make this really simple control process work in the best way. And so we’re trying to tie together a variety of sort of input features.

There’s the linguistic information about what they said and what devices they own. But importantly, the thing that’s really interesting about this feature and about this smart home space in general is that we’re also bringing in all of that context from the physical world, so behavioral data, or sort of environmental context. What room are they in when they’re speaking to Alexa? So really, by fusing all of that together, that’s how we get to that very simple experience where Alexa just understands what the customer means, and makes it really easy for them to just turn on that light, or whatever action it is that they want to take.

SEE: 21 technical Alexa Skills IT pros should know (TechRepublic Premium)

Voice input presents unique challenges to developers and engineers

Bill Detwiler: As someone who’s used to writing code, who’s used to working on sort of engineering problems, how difficult is dealing with voice compared, with all its nuances and complexities that you just laid out, compared to other input forms? Flipping a switch, touching a button, typing something in and using predictive text, we’ve all experienced with web searches. But voice seems to be, as I listen to you explain it, as I talk to other experts, as I talk to people that have been in this field for quite a number of years, a magnitude more difficult to deal with because of all those complexities, because of the different languages, because of the different contexts that people use when they just speak.

Evan Welbourne: Absolutely. A couple notes about that, so from my perspective, voice absolutely is kind of one of the big frontiers for machine learning. It’s a super hard problem. It’s really exciting, just in the last couple years, there’ve been these big advances in voice, speech recognition, natural language, using deep learning. And that’s made our lives a lot easier, and so there’s some kind of fundamental problems that feel a little bit more solved now. And you can kind of take some of these solutions out of the box and apply them here and there to get you 75% of the way to your solution.

But the thing that’s really interesting in this smart home space, and I think it’s true for most demands when you’re dealing with something like a voice assistant, most other application demands, that is, is that you need to kind of fuse that with other types of information like that, that context is really important, the figuring out what it is the customer wants when they’re giving you these voice commands. We need to know where they are. We need to know what time of day it is. We need to know maybe what they usually do in this situation. And things like the angle of the sun are actually important for smart home, predicting whether they’re going to interact with a light or not, that sort of thing.

And so it’s just a really important, interesting new challenge that we’ve got this voice capability that’s getting better and better, but we’ve still got to kind of apply that type of new voice recognition and NLU technology to our particular application space. It gets us maybe 75% of the way there, but that last 25% is about figuring out: How do we do this in our situation? What does it mean to speak about smart home? It’s a little different than just plain English.

A really basic example, if you’re just talking about semantic similarity between what they say and what they mean, if you’re in this smart home space, they could say, “Family room,” and really, they mean living room. So there’s obvious similarity between family and living when you’re talking about a smart home. But if it’s just plain English, you’re just using an English natural language model, there’s not really that much similarity between the words family and living. So all of this kind of gets adapted to kind of the application space. I think that’s one of the most interesting challenges that my team faces.

SEE: IT leader’s guide to the future of artificial intelligence (TechRepublic Premium)

Using AI and ML to help Alexa be more predictive

Bill Detwiler: As you look at sort of the smart home as a whole, speaking about that context, how difficult is it to get the information that you need to make the predictive decision? The AI and ML is only as good as the algorithms and the people that design it, and as the data we feed into it. Right? So how do you collect enough information when someone maybe only has one device in their home? It sounds like you get the most data, and would be able to predict the best result if you had a variety of devices, say devices that detect ambient light, devices that detect, like you said, temperature, devices that detect … Now you can pack sensors into maybe one device, and a lot of them do that.

But it does seem like you almost need multiple sensors, multiple input devices, that then kind of allow you to get enough information, like you outlaid, to help people, to help the system make the right decision at that moment in time. Is that a fair statement? Is that a fair kind of assessment? Or no, we can really just do it with one kind of thing, we can get data from other sources, whether they’re other sources on the internet, or other sources in the area. Talk about that a little bit.

Rohit Prasad, vice president and head scientist for Alexa Artificial Intelligence at Amazon, teaching Alexa. ” data-credit=”Amazon” rel=”noopener noreferrer nofollow”>

Rohit Prasad, vice president and head scientist for Alexa Artificial Intelligence at Amazon, teaching Alexa.
Amazon

Evan Welbourne: Yeah, absolutely. That’s a great question. I think that there’s kind of a few dimensions to the problem. I mean, one part, it’s absolutely true, there’s kind of a data sparsity problem for a lot of customers. They’ve got, especially brand new customers, they’ve got one light bulb. It’s not the same as if a customer’s got this full decked out house with light bulbs and thermostats, security, they’ve got everything. Well, that gives us some more information to go on, of course. But right, how do you deal with that data sparsity problem? Well, one of the things that if we think about the language example, if you just think about English. Well, you may not know very much about how I speak English, but you know other people who speak English. You know how English works, kind of. It’s a language, so it varies from person to person and region to region.

But there’s kind of a general understanding of how it should work. And what we’ve found interestingly about smart home, internet of things, is that metaphorically speaking, there’s kind of a language to devices and interaction with devices in the home as well. And that’s one of the greatest advantages we have, is trying to kind of, across many customers, we can understand something about the language of the home. So if someone for example, they’ve got two lights. They’ve got a front porch light and a bedroom lamp. Even just by the names of those devices, we know something about them, even if they’ve never used them before. We already know because we’ve seen many front porch lights. We’ve seen many bedroom lamps.

We know that probably the front porch light is more likely to just be left on overnight. It’s only on, comes on at dusk, and then it stays on overnight. And the bedroom lamp, well, that’s going to be probably on in the evening for a little bit, like half an hour or an hour, and then it’ll turn off again. It’ll be off all night long. So there’s almost sort of a common sense, or you could think of it kind of like the language of the home. That’s something that is incredibly useful, and we see that’s true even in the face of this data sparsity problem, that it’s really valuable. And of course, as customers get more devices, we learn more about kind of their home setup, and that sort of helps us as well.

SEE: Internet of Things policy (TechRepublic Premium)

Amazon Alexa and data privacy, transparency

Bill Detwiler: I’m someone who really, I actually have to admit, I have several smart home devices. I’m in tech, so I’ve got Amazon Echo devices. I’ve got August smart locks. I’ve got [an Apple HomePod]. I’ve got all kinds of … I’ve got Phillips Hue light bulbs, just throwing out random names of devices that I kind of have around. Not really, don’t mean to be promoting any one sort of company. But what I’m curious about is, as we sort of bring the devices into our house, as we kind of use these devices, I’m a pretty privacy and security conscious kind of person. Right? But in order to have value, the system does have to learn some things about me.

2020 Amazon Echo Show 10 ($249.99)” data-credit=”Amazon” rel=”noopener noreferrer nofollow”>

2020 Amazon Echo Show 10 ($249.99)
Amazon

It’s something that in order for me to get the most benefit out of it, it does kind of, like you said, need to know. Well, at what time of day do I usually ask it to do these things? And then, oh, it knows. Well, Bill likes to do this. So there’s always a concern about: Hey, how do we balance that convenience and that sort of the device’s ability to help us, versus, hey, I don’t necessarily want someone else maybe knowing when I leave my house? Because that could open me up to this, or I just don’t want everybody and their cousin to know what products that I buy, or what I like, or what events may be happening in my life that I may be sensitive about.

So my question to us is I guess not really from a … As someone who is working in this field on a daily basis, as someone who is sort of designing these systems and looking at this, where do you think … And people are already comfortable with giving up a lot of privacy. They have been for decades with the web, not just with IoT devices. But so we’ve already kind of, that Pandora’s box has already been opened. But my question is: As the devices are tied more and more into our lives, not just sort of what we do on our screens, but into our everyday lives, where do you see kind of the concerns of privacy being addressed in the system? Is it more about kind of encryption and ensuring sort of the right kind of data privacy regulations?

Is it more towards storing the information on the devices themselves, making the devices … And we’ve seen some manufacturers work that way, making the devices powerful enough, making the processing, the processors powerful enough to do the processing and to store the data on the device, so there doesn’t have to be sent to the cloud and back. So my Echo, my Smart Lock, my Home Pod, it may know kind of about me, but the cloud doesn’t, so there is a little bit of separation there. Just in general, how do you see that kind of shaping up in the next sort of few years?

Evan Welbourne: Yeah. It’s a good question. As you know, there’s these many different approaches, many different types of sort of technical approaches to dealing with privacy. Some of them have to do with sort of the rigor of encryption. Other things are more architectural, like edge devices and so on. I think, so a few notes, one, if we’re talking about sort of overall positioning on privacy and outlook, I’m probably not the best person to speak about that, for Amazon in any case. But I think a few things that are definitely true are … And also, I can’t comment on the future roadmap.

But we do put customer privacy kind of at the forefront of what we do. Customer trust and control and transparency in data are something that you’ll see right now in the Alexa app, and regardless of the architecture, that’s sort of how we address privacy going forward. If you think about, I think part of your question was kind of about the architecture. Are we doing this sort of edge computing? Do we see things going more in that direction? Or are we uploading things to the cloud? I can’t comment too much on that except to say that there’s affordances in each of these approaches. Right?

You sort of push the intelligence to the edge of the device. And you might get a latency benefit, as well as some of that privacy benefit. Or you can use other types of algorithms and privacy preserving techniques in the cloud that may or may not even work on the device. So there’s kind of trade offs. It’s a super interesting, challenging space. And for sure, we’re working on that as sort of key problem going forward. But again, I can’t comment on specific roadmap items.

SEE: Artificial intelligence ethics policy (TechRepublic Premium)

COVID-19 had a significant effect on how people use Alexa

Bill Detwiler: Yeah. I completely understand. So let’s switch gears a little bit. Let’s talk a little bit about kind of COVID-19 and the pandemic, and how it’s affected people, how it’s affected them, now that they’re working remotely more, now they may be interacting more with their IoT devices than they had been just six months ago. Now that people are home more, they decide, you know what, I really want to get some smart devices and IoT devices, and add them in. So they’re may be doing that more.

What are some of the ways that you all have seen the pandemic maybe change how people are using the devices? What kind of, either in terms of just the frequency, or actually, are they using the devices to learn about COVID-19? What are some of the changes that you’ve seen over the last six months?

TechRepublic Cracking Open the original Amazon Echo back in 2015.” data-credit=”Bill Detwiler/TechRepublic” rel=”noopener noreferrer nofollow”>

Bill Detwiler/TechRepublic

Evan Welbourne: Yeah, absolutely. So there’s a lot of changes. As you mentioned, people are interacting with devices more often. Some of the places, speaking broadly about Alexa, some of the places we see that most are in the sort of calls between family members. There’s I think twice as many calls made with Alexa now as there were at this time last year. There was definitely a spike there. People are using Alexa to stay more informed about COVID-19. Off the top of my head, I don’t know the statistics, but people are issuing many queries. They’re also issuing queries for things like recipes. People are staying home more often. They’re not eating out, so a lot more queries about recipes, streaming media use, all of that is increased for sure.

The other thing, if you’re talking about smart devices and internet of things, is that kind of as you would expect, people are interacting with their devices more often. Right? And we talked about these sort of typical patterns of uses. One of the things we would see, say at this time last year, is a pretty clear pattern of the 9:00 to 5:00 work schedule on weekdays. People would interact with devices mostly in the afternoon when they get home, and again in the morning when they wake up. But there would be this kind of spot in the middle of the day where not much happened. That was kind of the general trend in our traffic patterns, whereas weekends looked a little more balanced.

People were at home a little more often, turning on and off lights, using smart plugs, things like that. But now, it’s remarkable. Worldwide, 90% of our customers, their pattern, that general trend has shifted towards every day looking like weekends used to look. So they’re interacting with devices all the time. There’s not really as much of a predictable kind of a 9:00 to 5:00 pattern in there. And it shows up in our models as well, when we predict things like Did You Mean, or the Hunches feature, we’re using those behavioral patterns to try to figure out. Did they mean to leave the back door unlocked at this time? Or should it be locked? And that’s changed remarkably as people have started working from home, so we’ve had to get on top of it and update the models, and try to adapt to the changing situation.

Bill Detwiler: I think that’s a really excellent point because on…of the criticisms that’s leveled at AI and machine learning a lot is that it has trouble dealing with black swan events. It has this wealth of knowledge about a specific problem. You’ve fed it all these data sets. But that one black swan event that wasn’t in the data set, it obviously couldn’t prepare for because it doesn’t understand that it exists. So how do we make, or is it possible to make AI and ML algorithms and systems that can respond to those events? Or does it actually just take humans to still jump in, like I guess you’re describing your team did? We saw this event, so we have to adjust our patterns. We have to adjust the algorithms to this new normal.

Evan Welbourne: Yeah, absolutely. It’s a hard problem for sure, especially it’s like a critical challenge for artificially intelligent systems, I think, dealing with these black swan events. There’s no training data to support it. And for sure, COVID-19 is giving us one. We see others throughout the year. Another one, for example, that we saw a couple years ago was the, I’m trying to remember, is the McGregor versus Mayweather boxing match, was this huge televised event in the summer. And all of a sudden, we had all this traffic for televisions and lights and so on. Everything was on that one night, and it was just this global anomaly, everything all at once.

And so there’s other smaller types of events that kind of cause the black swan as well. But I think the solution, interestingly to me, is that it’s partly machine learning and AI, plus building sort of systems that will adapt quickly and can be kind of run collaboratively with a team that’s more responsive perhaps to black swan events. But it’s also just in the way you present the feature to customers and the controls that you give them. You have to sort of design the user interaction in a way that is kind of robust to high uncertainty.

Amazon

So even in a very simple way, this Did You Mean feature is like that. We don’t automatically take an action just because we have pretty high confidence we are right. We still ask them. We want to make sure that it is in fact what the customer intended. We don’t want to do anything to surprise them. In various ways, whenever we’re building some intelligent system for the smart home, we try to sort of consider that in designing the interaction, designing the experience for the customer, try to sort of understand that in advance, there’s going to be uncertainty. There’s going to be things we’re not sure about. So we want to have the customer in the loop sort of with us. It’s more of a collaboration than a, we’re going to do this thing automatically, and we hope you like it.

Bill Detwiler: I guess that’s one thing if it’s turning off a light bulb, but it is something different if you say you’re opening a garage door, or closing a garage door, or taking an action that has more serious consequences than say, maybe just turning on and off a light. So I guess you do have to use sort of the old trust, but verify, method of determining what the user really intends.

Evan Welbourne: Yeah, absolutely. With the Hunches experience, for example, we can lock doors with that. You don’t want to lock the door on somebody if they’re just out in their backyard. So that’s another one of these examples where we’re either going to send them a push notification saying, “We think maybe your back door should be locked.” You’ve got to go through and do it yourself to lock it, or otherwise there’s this voice experience. We still ask them, “Did you mean to lock? Or did you want to lock the back door?” And they’ve got to say, “Yes,” and explicitly confirm to take that action. So absolutely, that’s part of the design.

What’s the future of voice and smart home tech look like?

Bill Detwiler: So I’d love to wrap up by just getting a little bit of your sense about where we’re heading in terms of AI and machine learning in respect to smart home devices. What do you see kind of on the horizon? Not specifically about anything you’re working on there at Alexa, or Amazon’s working on, so you don’t have to reveal any kind of roadmap. But just thinking about in general, as someone who’s been in this field for a while now, you kind of start to see kind of trends. What has you excited about the possibilities for AI and machine learning with Smart Home?

Evan Welbourne: Yeah, yeah. I am quite excited. My whole career has been about smart devices and applying machine learning to smart devices, so I’m really excited about where we are now. I think probably the key takeaway, and sort of the key point for machine learning and AI applied in the space is that whether we’re kind of streamlining that basic experience, making it work for everybody, or whether we’re doing something a little more proactive for the customer, like Hunches, or the Guard feature, it’s sort of number one that technologies like voice are a great simplifier. We’ve already seen that with Alexa. It’s really what’s brought smart home to a much wider audience, and we’re trying to leverage that.

But even more so, it’s the application of machine learning to the physical world, to that context about sort of the behavioral data, the environmental context. That’s really what lets us operate intelligently in the physical world. And so it’s about kind of fusing more of that contextual information into the experience for the customer that’s going to kind of unlock the next wave of smart device experiences and sort of more proactive experiences, whether it’s sort of understanding your intentions in the home, even longer term intentions about goals you want to accomplish in the home beyond turning on and off lights.

You want to save money, or whatever it is, all of that’s going to come back to this kind of a little more physical context. Where’s the sun right now? What’s the weather like? All of this kind of information is super important in addition to that voice understanding.