The best tools for developing voice user interfaces

Image: Adobe Stock/Erkan

A voice user interface, or VUI (pronounced VOO-hee), is described as a technology that allows people to interact with a computer or device using spoken commands. VUI technology is evolving much faster than its predecessors (think keyboards, mice and touchscreens). It’s estimated that 94 million people own a smart speaker in the U.S. alone, and anyone who has used a mobile phone or TV remote in the last five years knows stand alone smart speakers aren’t the only place where voice user interfaces are prevalent.

A lot of this growth can be attributed to the technology itself. The artificial intelligence that powers the natural language understanding (NLU) behind the voice-powered experiences of giants like Apple, Amazon and Google is nothing short of amazing, but it’s not just the remarkable technology that is driving the growth.

Consider that we as human beings have been using spoken language for no less than 200,000 years (by most accounts). There are more than 6,000 languages spoken today by people around the globe. When you combine this with the knowledge that on average people speak 125 to 300 words per minute (over three times faster than they type), it’s no wonder voice user interfaces are on the rise. In fact you could reasonably make an argument that if this technology had existed when computers first became available, none of us may have bothered with learning to type at all. Humans are hardwired for VUI.

However, the technological advances required to power very accurate voice user interfaces were not available when computers came on the scene. Growing up in the 80’s, being able to speak commands to a computer was the stuff of science fiction—the far off future on the bridge of a starship if you believed what you saw on television. So, in many ways it was science fiction writers and their imaginations that shaped the VUI of today.

That won’t be the case for the VUI of tomorrow. There is a whole generation of children now growing up alongside voice assistants. A generation of children who will never know of a world where this technology did not exist. That in itself is very powerful and will surely shape the technology with a lifetime of empirical and anecdotal evidence. But there is more to this story than just the notion that by the time a child uses a computer, they will also have a voice user interface at their beck and call.
Most children learn to speak well before they can read or write. Which means, in many cases, the very first digital interaction a child has will be a voice-first experience.

The burgeoning voice user interface market

Back in 2018, Amazon launched its Echo Dot for Kids. Now in its fourth incarnation, the Echo Dot for Kids entered the market amid a growing realization that: A) younger children were using voice devices around the house, and B) the crop of devices on the market circa 2018 were built with adults, not their children, in mind. With its Echo Dot for Kids, Amazon sought to address concerns amid news headlines focused on incidents where children ordered toys via Alexa without parental permission, and some experts worried virtual assistants could teach children bad manners.

But pioneering a voice platform for children is not just about creating an experience that has more guardrails. It’s about curating that experience with content. With its Amazon Kids+ subscription, Amazon is working with partners to unlock the potential of this technology with very specific learning experiences tailored to kids as young as three years old.

Amazon must be onto something, as other large players in the natural language processing space have followed suit. Google, for example, threw its hat into the ring with a voice assistant aimed at children in 2020. Meanwhile startups like MyBuddy.ai, who are focusing specifically on voice technology for children, are finding investors willing to fuel their journey as perceived disruptors. The potential benefit VUI holds for children, especially when it comes to educational outcomes both at home and in a classroom, is hard to ignore.

Software developers are quick to point out that best practices for developing voice experiences for children is still a mixed bag. There are the obvious security and privacy concerns as well as technical and design hurdles. The problem is rooted in the fact that the underlying models, which power the leading voice tools on the market, were created by recording and analyzing the speech patterns of millions of adults.

Deciphering a child’s intent can be much more complex. There is an incredible amount of variance in children’s voices and speaking patterns. Children sometimes over-enunciate words, elongate syllables, skip words entirely or pause dramatically as they think aloud. As adults, we tend to adjust our speech patterns when speaking to a digital voice interface. Not so with children. Kids simply blurt out what they are thinking as it comes to them.

While some of these challenges may be technological, an experienced voice designer can address an overwhelming number of them with thoughtful planning and testing. And there is guidance out there if you’re willing to dig a little. PBS, Disney, Sesame Street and Cartoon Network have all built voice experiences targeted for children ages six and younger, and many of their development teams have shared learnings in podcasts, blogs and white papers. Amazon, for example, has a free downloadable white paper titled “6 Tips for Building Stellar Kids Skills,” that has great guidance. Perhaps even more impressive is the list of 12 design principles for voice published by the BBC design team and inspired by work they did on a branded voice experience for three- to seven-year-olds.

Leading the charge in the voice user interface space

One brand looking for ways to bring meaningful voice experiences to pre and early readers is Noggin. Noggin (a part of Nickelodeon owned by ViacomCBS) recently launched an interactive voice forward experience titled “feeling faces” in the Noggin app for iOS and Android. It’s a highly interactive experience, where a child gets to converse directly with Nick Jr.’s iconic “Paw Patrol” favorite Rubble. Described by Nick Jr. as a “gruff but lovable English Bulldog,” Rubble will demonstrate various “faces” within the app and ask children to shout out what emotion they think their favorite animated pup is feeling.

Image: Noggin
The Noggin Feeling Faces interactive voice experience in action on an iPad.

TechRepublic had the opportunity to sit down and discuss the project with Tim Adams, vice president of the emerging products group at ViacomCBS. His team is responsible for matching emerging technologies, like VUI, with Viacom’s brands, intellectual properties and, of course, the audience. Adams’ team supports a number of brands from MTV to Comedy Central. They’ve been involved in voice projects since Amazon opened Alexa up to third-party skills. But Noggin, with its preschool aged audience, required something special.

According to Adams, they had a number of ideas. “You could use voice to sort of guide a narrative,” he said. “And we tried that, and it didn’t totally match…it wasn’t compelling because it didn’t feel that intimate or conversational.”

Then Adams and team ran across “Paw Patrol” and the work they were doing on “feeling faces.” “These were short-form [videos] where the characters were talking directly to the camera, and we said let’s do that!”

Once the idea was formed, the work went fast. Adams and his team retrofitted existing linear content to make it interactive with voice. They did lots of user testing, looking for ways the experience might fall down for this young audience. They got some good metrics—and more.

Adams went on to explain. “There are moments where he [the ‘Paw Patrol’ character] will ask ‘Let me see your funny face,’ and they [the kids] do it with total honesty…it’s not like this kind of robotic back and forth between the kid and the content. For them, it’s very very natural.”

Of course engagement wasn’t the only priority.

“First and foremost, it has to be safe for kids,” Adams added. His team worked from a compliance and technology perspective to develop a solution that doesn’t send any voice or data to the cloud for processing. An impressive feat considering how CPU intensive natural language processing can be.

While Adams says this is just a pilot, the results look promising. When it launched in September 2021, the “feeling faces” content in the Noggin app was among the top performing.

One of the big takeaways Adams has for teams looking to replicate Noggin’s success in the voice arena is a design principle he coined as creating “bumper lanes.” Adam and team simply accepted that because of the technology limitations and where these kids land all over the spectrum in terms of speech development, there will be times when the VUI won’t be able to correctly decode the child’s intent. For Adams, the key was to replace that frustrating moment with an enjoyable one that guides the child back onto the conversation map towards the ultimate goal.

“Like the bumper lanes at a bowling alley that are admittedly sort of fun when you bump into them,” Adams explained.

Developer VUI tools of the trade

While training voice models to successfully recognize inputs from younger users requires significantly more testing, the current crop of tools used to develop these experiences are largely the same ones used for developing voice experiences for the general population. Those tools have matured greatly over the last five years, and there is no reason to think they won’t just keep getting better. What that means is you no longer have to be a specialist to develop voice user interfaces. If you’re passionate about building meaningful voice-first experiences for kids, there are a number of tools and services you could get started with ASAP.

Alexa Skills Kit (ASK)

Amazon’s voice assistant was early on the scene and has a strong base to get you started. What’s more, the Alexa Skills Kit is an easy way to dip your toes into VUI development. With it, you can get up and running quickly, and if your requirements grow beyond what ASK can handle, you can use what you’ve learned to make the jump to some of the more specialized NLU and text-to-speech (TTS) Amazon Web Services like Lex and Polly.

Action Builder (For Google Assistant)

Google Assistant is everywhere—smart speakers, remote controls, thermostats and, of course, in our web browsers and on our phones. While Google’s Action Builder has arguably a slightly higher learning curve than the Alexa Skill Kit, Google’s code labs offer free, hands-on, introductory and intermediate courses to get you up and running in no time.

Annyang

While Annyang only handles the NLP side of the equation, it does so with an open source, MIT-licensed, Javascript speech recognition library that weighs in under two kilobytes and runs entirely client side. This can be quite a boon when you are building an application for children and need to ensure no identifying information is stored or sent over the internet as a condition of the Children’s Online Privacy Protection Act.

Mycroft

This is another open source option. Unlike most of the other voice toolkits mentioned here that are JavaScript slanted, Mycroft is natively Python and meant to be an entirely open source digital assistant. The entire stack can be deployed on your own custom hardware, making it a bit more vendor agnostic than some of the other choices on the market.

Web Speech API

No discussion of NLU tools would be complete without a mention of the Web Speech API. Drafted by the W3C Community in 2012, this is a fairly comprehensive web-based solution. Unfortunately, as of 2021, it still does not have across-the-board browser support. Still, if you know your project is limited to certain versions of Chrome and/or Mozilla, it’s a quick way to jump into VUI development.

Final thoughts

It’s difficult to speculate what the VUI of tomorrow will look or sound like. All you have to do is watch the excerpt from last year’s Google IO, where the company’s breakthrough voice technology personified the planet Pluto and later a paper airplane, to know that this field is headed into previously uncharted territory. What should be clear is that the users of tomorrow’s VUI are here today. The opportunity to invest in these users, our children and the potential VUI holds for them is real — and it’s important we get it right.

Source: TechRepublic