Video

SoundHound: The Evolution of Voice-Enabled AI

Learn why it’s as important to adopt a voice-enable strategy today as it was to adopt a mobile strategy back in 2007.

Engage 2017

Rainer Leeb, Senior Director of Growth @ SoundHound

Video Transcript:

All right. Hello, everybody. Welcome back. As before lunch– I hope you have all chilled from this very intense session with Tim. [LAUGHTER] Right? Felt a little bit like The Matrix meeting Blade Runner, So I’m going to cool it down. Everyone exhale. Take it easy. I’m going to talk about another AI topic today, which is voice-enabled AI. And one of the things that Tim didn’t talk about, one of the things that computers don’t really do very well today is understand what you are saying. And so questions like, where are we in this path, right? We are getting better at this. And so I wanted to give you an update a little bit of where we are in the evolution of voice-enable AI. So back in 2007– like, if you were a website, that’s was your main product, and you did not have a mobile strategy, you’re probably lost by now. Because there was early adopters that bet on this technology, believed in it, moved forward, created tremendous value for the customers, and disrupted the marketplace. In 2017 if you don’t have a voice strategy right now, you’re not an early adopter. You’re likely to not create tremendous value, and be disrupted by others. So ask yourself, do you, with your products, have a voice strategy yet? Do you have a plan? And the reason I’m saying this, it’s happening fast. In Tim’s terms, this is like we are already at the beginning of page 501 and it’s really happening very fast. So in the ’80s the way that we input things into devices was very different from what it is now and what it will be in the future. And we’re talking the very near future. In the ’80s first home computers were introduced, desktop computers. What you got was a keyboard to type and a mouse which was already an incredible invention. But still fairly inefficient and clumsy. In the ’90s computer manufacturers started working on touchscreen. So maybe you remember Bill Gates early tried to build a tablet with touch, kind of the hybrid model. Didn’t work out that well, but somebody like Steve Jobs came along and had a bigger vision of touching and swiping and built an iPad, and all of a sudden the tablet was here and was a huge success. So all it needs is the write human interface, a user experience that people love and that works for them. And today you see voice evolving as an input factor. And it’s happening fast. We know when Siri launched people felt it was a bit clumsy at the beginning, but it’s getting much better. And it’s getting better every day. Google is coming out with products. Amazon is coming And all of them use voice as the way to interact with those devices. And why? Because not every device has a screen or needs a screen. But every device in the future will be connected. And then there is us. You know, we are adults in kind of these technologies get introduced to us when we’re already used to something else. So we feel like it’s not going to happen. You know, this is really not working or it’s working as we expected. But think about it. If you were a kid growing up in 2000s, early 2000s, and you were growing up with touch and swipe devices, it’s not that these kids got serious catalogs sent home to watch, you know, images or products. They grew up with these devices. And from the beginning they do this pinching, right? So if you give them a catalog they start pinching the photos, but they don’t move. Because they expect that. And so today for these kids it’s all about touching. So the touch screen is now also on the computer. The mouse is gone. We still have the keyboard to input, but most of it is touching, tapping, swiping. But it’s still not very efficient because, you know– I mean, some people are really, really fast in typing when they write a text message. Like, crazy– [IMITATING RAPID TEXTING] And the kids are better. I’m clumsy. My thumbs are too big. I mistype. It’s crazy. Why can’t I tell this thing, just say what I mean, and it’s just going to send enough. In 2015, if you were born– You’re going to grow up and it’s going to be very different for you, because everyone is going to have one of these devices. And you can talk to them. It’s amazing. And so they start to expect that. There’s the little thing, the little lights, the displays, great user experience, fun. You know, kids are curious so they start talking, “”Alexa.”” Off he goes. Tell me a joke. Really funny. Love it. So they will expect that. One day the user product– they’re going to talk to it. So this is what today’s familiar voice tech landscape looks like. Consumer devices made by technology companies, integrated into speakers, phones. But that’s not the end of it. Think self-driving cars, robots, simple things like fridges, washing machines. You didn’t think that you need to talk to them yet. We kind of find this awkward. But if you’re growing up like this little kid, you want the washing machine to understand what you are saying. You know? Turn it on. Medium temperature. Don’t boil the hell out of it. So where are we in terms of consumer options? So today we estimate that by 2020 50% of all the searches– and that’s a lot– Google alone we’re talking billions– will be input by voice. 19% of all online adults in the US that have a smartphone have used personal assistance. So I’m assuming everyone has a smartphone in the room. Who has used voice interaction through assistant on their phone? So already quite a lot. 100% of the kids do. So this generation is going to grow up doing it. And so what’s happening is there’s a rapid, rapid increase in voice search queries, 35 times over the last few years. And I think it sounds like a long time, but in the time of human beings, as Tim said, it’s not even a letter on that page. And then there is all these IoT devices. So today we have about five billion. In 2020 it will be 21 billion. Not all of these are going to have this place to input stuff, but all of them need to be commanded. So the most efficient way to do this is by just using your voice. Now, I want to add a little bit about what we do. I work for a company called SoundHound. It was founded in 2005 by Stanford graduates. And we have presences in US, Canada, and expanding in Asia. And our founders– they developed a voice-enabled AI platform that does natural language understanding combined with voice recognition. And I’ll get into that a little bit later. AI is very popular these days, so we’ve got a lot of funding, $120 million dollars. If you look at the investors, those are mostly large companies that are strategic investors, because they bet on the future. And it’s not so much venture capitalism. It’s more working with companies that have real products and putting smart voice-enabled AI into their products. And so our mission is to– as you saw, 20 billion devices. All of them need to be Houndified. Like, voice-enabled. But we also got consumer products. Why is this important? Because when you build a voice-enabled AI platform you also need to understand how people are saying it, so you can understand what they are saying but also what do they mean. And so this helps us collect a lot of information and really understand the user. Because one of the most complex things that we see is, like– so this is new to everyone. How do you integrate voice into your product from a UX experience? You’re a product manager. Like, how do I create this interaction? How does the user understand how to activate it, how they say it? What to say? What can this even do, right? It’s still a big black box to us. And so we built first a music-discovery app that actually understands when you sing and hum, and we will identify the song for you. And then we added, from a user interface perspective, voice commands that you can navigate the app, ask questions, and execute, and look for whatever you like. And then on top of that, of this Houndify platform, we built a smart personal assistant called Hound, and we offer a variety of– over 125 domains that you can access where you can do simple things like ask weather, ask the football score, ask for restaurants near you, navigate, et cetera. And that offering is obviously growing as we are growing the partnerships on the developer platform where people at companies like you add data and voice-enabled AI to the platform. So I’m going to go into [BEGIN VIDEO PLAYBACK] Introducing SoundHound 8. We’ve taken the app you know and love and made it faster, smarter, and designed it around your music needs. [MUSIC PLAYING] Any song you discover will be kept in your history, and you can easily see where you heard a song, and mark your favorites. You can listen to full tracks for free using YouTube, and even connect to Apple Music or Spotify to build a playlist of your discoveries and unlock real-time, moving, live lyrics. [MUSIC – MARIAN HILL, “”DOWN] (SINGING ) Are you down? Are you down? Are you d-d-down? Are you d-d-d-d-down? In addition to your discoveries and favorites playlists, we’ve curated top songs across genres to help you find your next favorite tunes. SoundHound 8 comes with “”OK Hound,”” a powerful and accurate way to find millions of songs hands-free using just your voice. On any screen just say, “”OK Hound,”” and request the song, artist, genre, and more. OK Hound, play Talk To Me”” by Marian Hill. Playing talk to me by Marian Hill. [MUSIC – MARIAN HILL, “”TALK TO ME””] (SINGING) You know you could get this. You got to do– SoundHound 8. The fastest and most accurate hands-free music discovery tool. [END VIDEO PLAYBACK] So this was to illustrate that you are able to integrate voice commands into an existing app. And it’s really cool because now if I’m driving a car and I have my phone there I can just command it around and tell me to play music. Or if there’s a song playing on the radio I can just tell the app, you know, and ask it what song is playing? No need to interfere or get interrupted. But, as I said, music is only one domain where we can use this. So we’re looking at much smarter assistance that can really execute a variety of domains. And this is Hound. OK Hound, what’s a good restaurant in Campbell that’s not fast food, Thai, or pizza? Here are several restaurants with more than three stars near Campbell sorted by rating, excluding fast foods, Thai restaurants, or pizzerias. [DING] OK. Here is what I found. [BUZZER] Is the first one open The first one is Walia Ethiopian Cuisine. Today it’s open from Lefty’s A Taste of Boston on East Campbell Avenue. Campbell is open today Does it have WiFi, and is it good for kids? It provides WiFi and is kid-friendly. Here’s what I found on the web for “”does it have WiFi and is it good for kids.”” How much would an Uber Select cost to get there? An Uber Select can take you from here to Walia Ethiopian Cuisine located at 2208 Business Circle in San Jose for about 37 US dollars to 48 US dollars. I found this on the web for [CHEERING] So I hope you got it. If you have a conversation with somebody, you need to remember what they were saying otherwise it’s not a conversational interface. It’s just, I give a command, you get something back. If it doesn’t remember what you said before, it’s out of context and you’ll get the wrong result. It’s not going to understand what’s happening. So there’s a lot of things going on beyond voice recognition that’s really being in the moment, taking the context of things that you have done before so that that device can help you, giving you the right answers and smart answers that are related, and you don’t have to repeat itself all the time to give it the context because it just lost it. And that’s one way. The other ways is want to say it naturally. So when you say naturally you need to take up irrelevant words that we are using, like, “”well.”” It doesn’t really mean anything, but that’s how we speak and so we need to understand all of these things. And it doesn’t have to be always perfect for us to understand. So we need to learn that. We need to learn how we speak, what to leave out, but what to consider, and how to classify it. And this is where the Houndify platform is unique in that sense that, when we built Hound– it kind of just showcase it here. I like this quote the best. And that’s really key to the user experience. If I talk to you and every time I would forget what you have just told me, you would be kind of pissed off, right? So Hound is powered by the Houndify platform, and think about the platform. It’s a comprehensive solution that today you can integrate into any software that you have, whether it’s self-driving cars, appliances, robots, IT devices, appliances, apps, services. Whatever it is– it’s just, a platform that’s in the cloud works through the API and gives you back the results that you need. It’s the world’s fastest ASR, so speech recognition technology. Which, we have shown you in the video, you can speak normally. You see it in real-time being transcribed. But what it also does is it has the most advanced natural language understanding. So it’s not recognizing what you’re saying, it’s what you mean is important. We can say stuff, but it doesn’t mean you understand it. So speech to meaning is the most important thing. If you combine the two you get a much, much faster and much more intelligent way how to do the conversation. And that’s what our founders have enabled. So the secret sauce to it is, if you look at traditional ways how these ASRs work is, you first translate speech into text. What is this person saying? And then it puts it in another bucket and says, what does this person mean? And then it would give you a result. Right? In the meantime, it means if it’s serial, the next time you say it has already forgotten what you did before. And so you got to repeat it over and over again, and it doesn’t sound like a very intelligent conversation. From a human perspective. So you’ve got to put the two together so they can learn, and it give you a lot more options, and it’s faster. So you look at complex queries, something that is not easy for humans to say, by the way, but you could. You probably want to do this a little bit in steps, but if you tried it once you’d say, Double negatives. Extremely complicated. But that’s sometimes how we talk, so we need to understand. So we need to understand that “”show me”” doesn’t mean anything. It’s really not something relevant. But you need to understand it’s, OK, it’s hotels. It’s in San Francisco. It’s for tomorrow. I only ask for a certain price. It has other factors, like I want it pet-friendly. I only want three or four stars because I don’t want to stay in a skimpy hotel. Et cetera, et cetera. So there’s a lot of things to absorb and to classify in real-time, and do this fast, and give you the response that you want. Because you want something very, very specific. And then, oh, if you have forgotten something– Now imagine you wanted this to sort it by price, but you wouldn’t remember what you’ve said. You would have to repeat this whole sentence and add “”sort by lowest price.”” Not a very fun experience. (SPEAKING RAPIDLY) [HEAVY BREATH] Not a good experience. I’m exhausted already from this. And then once I get a price and a hotel in that price, they you say, well, oh. I forgot to tell you maybe I don’t need to check in tomorrow. I want to get a better price here today. So now you have to say the whole thing again. I’ll spare you. You get my point. So with the speech to meaning, what we’re able to achieve is getting higher and higher accuracies, and obviously higher and higher accuracies are needed in order to make this experience a good experience. If you use a product and half of the time it doesn’t understand what you’re saying, you get frustrated very quickly. So we get frustrated. How many of you have called Siri names? Yeah. It’s like, you know– but it’s like a little kid. It’s like a five-year-old. [MUMBLING] You know? Silly– you know? Good thing is– and going back to [? teamser– ?] this is going to learn very fast. And soon we don’t have to talk to it like a five-year-old anymore. We can have an intelligent conversation with a bot. So what’s the industry landscape and why is this so difficult? I mean– first of all, you don’t build speech recognition and natural language understanding overnight. So there’s very few companies out there that do that and have it at a technical level that this is ready for prime time, as they say. So high barriers to entry. It takes many years. Many years. Lots of data. And you need tools. If you want to integrate this you have to be a developer. Like, I just want to sign up for this platform. I want to understand how this is working. I want to integrate it to my app or my robot. And then when I give these commands so that it understands, and you can execute what it needs to execute. So it needs tools. But again, you know, it’s a user experience. The better technology will win. It has to be because if it doesn’t understand what you’re saying or mean you get wrong results. Customers are not going be happy. But then what you also want is, you don’t want to become dependent on the big guys. You have your own product, your own brand. Do you want your product to call Alexa to tell you about your product, or do you want it to talk to you and your product? If I’m Mercedes and I put it in the car, you know, I want my car to be talked to like it’s the car. It’s not Alexa or Siri. And say, hey, drive faster. Do you say, “”Siri, drive faster?”” No. I don’t think so. You also want to learn from your data. So do all the big guys. Google wants to Everybody else does. But you want to make sure that you own your data from your customers, and the insights, and you have access to it. So you can do that. But the ambition is bigger. It’s like, we cannot build the world’s most intelligent voice-enabled AI platform. We can build good voice recognition and good language understanding, but it’s bigger than that because you can only do what you can do with the data. So the vision is, everyone’s got to contribute. It’s an open platform. And you can add, and you can add. And so it becomes bigger. Larger than the part of sum. You can share it. If you want to integrate it, offer it as an API so that others can use the service with voice, you can do that. So it becomes way more intelligent and the future is collective. It’s collaboration. So that this intelligence has the most people contribute, because the more people contribute the more diverse and the more intelligent the system will become. So I invite you to do that with us today. I’ll say thank you, and let’s connect what we can create together. [APPLAUSE] We’ve got a few minutes for questions. You had a slide earlier which talked about two approaches, the [INAUDIBLE] plus ASR, and then the other [INAUDIBLE] that you’re following which is speech to– so does it mean, fundamentally, if you’re building a bot that understands text and responds to text versus understands to voice, you should follow two separate approaches, and that’s the best way to– Well, what it means is– let’s think about– you know, you can go with technology that does voice recognition. It’s like dictation. So you get dictation and it just basically transcribes what you’re saying. But it doesn’t classify anything, so it doesn’t really know what you mean. So for that you need a separate platform, which is what the industry calls natural language understanding, NLU. So that basically takes all the words and it parses the words and says, there’s keywords, and classifies these and says, well, this looks like it’s related to, let’s say, restaurants. And then it figures out somewhere in the database to pull up restaurants, and it gives it back to you. So you can go with two systems. The problem is, if you want sequential processing. And so it can’t remember, in that sense, what you did or what the context is. So it fails in many ways with keeping the context and allowing you to follow up and have an intelligent conversation, and going back and forth like a conversation is. So that’s why the power is in combining the two so that the system knows what you meant in your first query and you can do something with it later. And you can’t do that when you do separate platforms. That’s what the understanding is. For apps that are doing nothing with voice today, what does it look as a natural– within using voice, and then using voice to control more of the app. Like, what have you seen be successful in terms of the progression of getting into a voice-based app? So I think the logical answer here is, it’s been most successful where you have no display, because you have no other chance. So that way also the users understand, or you can train the users and say, here, here’s the things that you can do with this, and you can speak to it. You know, when you’re integrated in something where you already have tapping, and that’s what users are used to, where I see it most successful is to deep-dive quickly. Think of navigations that are, like, five-step things. They need five taps until you get to what you want. Like, typical step processes in the navigation funnels. So if you want to get to the end state quickly, voice is much faster or it replaces long typing. Voice is much faster. So in music, for example, if I wanted to know when my favorite artist’s birthday was or for that matter something else, or when the band was formed, you can either search for it on the web by putting text, even if I can just say it, you are way faster in producing these characters in saying it than you are in typing unless you’re a world-class texter. So in those circumstances I have seen it most successful. And filtering. So think of drilling down. So we talked to a bunch of folks here from e-commerce. If you’re looking for a particular item today, you start searching but then you say– you get this massive result set. And then you have to drill down by attributes. If I can say that whole thing in one sentence and drill down to, I want a red Tommy Bahama

shirt under $100 in XL, et cetera, something like this, you can say it faster than you are navigating the web site or the app. Oh. Sorry. We’ll make sure we get you. Thank you. Following up to that question, I guess. Obviously you may have a little bit of a biased perspective, but for someone starting out, what would you recommend as the order of integrating into voice services like Google assistant or Siri, just in terms of prioritizing native speech recognition functionality to your app verses, like, operating level system? Well, I think that today the landscape is very fragmented. So it’s still not an either/or. so it’s probably– you bet on a few horses. And if you’re a consumer product, in particular, you want to go where the traffic is. But you also want to satisfy your own customers’ needs and you are ready for the future. So what is very common is, people get started building Alexa skills at the same time as they’re looking at integrating voice that is their own interface with their customer in their apps. But, you know, who knows? It’s like, it’s not clear. You know, if you want to be in the car, which platform is going to win in the car? We’ll see how it’ll play out. So it’s like operating systems. Like I said, app developer. Back in the days there were people that were actually developing BlackBerry apps. No one is developing BlackBerry apps anymore. So, you know, it’s a selection process. And then you’ll focus on where you get the best traffic. Hi. So you talked about kind of learning from how people search and I’m trying to– can you give a use case of– let’s say I list several different variables but you might learn that one of them really is the most important based on my past history. Do you have any kind of thoughts about that? Sure. I mean, it’s like, that’s where data and machine learning comes into place. Personalization is obviously more– is important. But, you know, you are referring more to what I’m surfacing in terms of content that is relevant to you. It’s not going to change the meaning of the words that you are saying. That’s usually, in most cases, unambiguous. But it can be ambiguous in the sense of, there’s words that have the same meaning, in theory. So I need to understand the context in which you are saying it. So that’s going to be more of learning as a typical search problem that Google has when you search. But the more specific you are, the less ambiguous it will be. So that’s where the followup stuff is important. So let me give you an example. People still– it’s crazy, right? I see the data and people go in and they say, Wal-Mart. Like, OK. It’s like– so they’re using voice as if it was a keyword search back in the days. That’s not the idea of natural language understanding. But that’s what’s happening. So we’ll learn. It will– we’ll learn how to speak. It’s really crazy because it’s– you guys, it’s really funny because I see these– in particular Asia, they love these little robots. And they’re so cute. So the people talk to them like they were little kids, like babies. It’s really funny. So I think we need to get used to it. We need to learn it. It’s learning the behavior and learning what systems can do so we adjust a little bit what we’re saying. So in today’s world I think keyword search on Google is no longer dominated by single keywords. But it took a long time until people learned to say, I need to give this system a little bit more information so that it knows what I’m saying. So it will be similar. So two questions. One of them is around opt-in rates for microphones, for iOS. So what is your experience there? Is that a challenge? And what is that like on Android? And the second question is, around specific keywords, where the natural language processing process starts, and you have a keyword that means something very specific to you and your business as your users search for it or speak for it. So there’s this process value lexicons of, you know, this particular keyword means something different to us than it does for everyone else. So how do you manage that manual process? Very good question. So first question was, opt-in rates for microphone. They’re very– they’re, like, over 99%. Because the product doesn’t work if you don’t use microphone. So I think we’re very clear around that problem. I think the bigger question behind this is, the perception of people thinking that this phone microphone on your device is listening to you all the time. I think that’s the key. So Google Home and Alexa, they do. So if you put an echo next to your bedroom, Amazon is listening. You know that. And if you have the new device and then it also has a camera, it’s also looking at you. So you’ve got to feel comfortable with that if you use these products. The truth is, in today’s apps– at least the apps that we launch– the microphone is only listening when you activate it. So you either tap the microphone button or you use to wake word. So we use “”OK Hound”” for our wake word, and then it starts listening. It doesn’t listen when the microphone is not activated. It’s not like this NSA tool. So your second question was custom commands, custom keyboards, basically? So in a third-party platform like ours, you build your private domain. And you train that domain for your customized use case. So if a word has a certain meaning in your use case it will be trained and will know when it’s related to that domain that it needs to execute it or interpret it in that way. So that’s possible. Yes. All right? Hey. You mentioned Google searching and how voice register differs from a typical keyword search that you type in. I’d be interested to hear how you think that will affect search engine marketing and search engine optimization in the long-run. That’s an excellent question. It will. Very much so. More importantly because the output– it now looks different. So the challenge for the search engine market is going to be, if a lot of these searches move away from web onto the mobile phone and they’re now voice, the output is going to look different. It’s a simple Google search– it’s, like everything else you’ll have data of what people are searching. And what we see though is, particularly for app marketers, by the way, it’s like search engine marketing per se is actually going away in the sense of, Google is no longer allowing you to make judgments about what keyword is relevant for your product. They make that judgment. So the keyword bidding is going to be finished very soon. It’s already gone away. I think this week actually Google is turning off keyword targeting for app installs because they decide which app users find most relevant for whatever keyword is being searched for. So at the end of the day this is all machine learning. So you don’t– the advertiser will no longer have to think about it. All they care about is, how much money are you willing to spend for an app download or a transaction? That’s it. We have one more question. That was actually my question. So thank you. Second question is about sources. What do you do with sources? Like, are you going to say in your results, like, “”according to–“” Because when you have search result you know which one you’re clicking. And who– is it Yelp that recommended that restaurant or is it whatever? So what about sources? Also a very good question. So right now, as we basically have domain providers for certain queries, like weather. And so there’s always a data source attached to the results. So on our Hound app, for example, if you do a restaurant search will surface Yelp data and it’ll specify that it’s Yelp data. That’s how it works.

More resources

Video
Leah Hooper Employee Spotlight

Uplanders are our most important asset; we help each other find joy and community in our work.

Read more

Video
Betty Yoon Employee Spotlight

Uplanders are our most important asset; we help each other find joy and community in our work.

Read more

Video
How Salem Five Bank reduces time to competency

In today’s competitive job market, it’s challenging for contact centers to find skilled agents with previous banking or financial services experience.

Read more