Abstract

The use of voice search has seen a significant increase over the past few years with the rise of voice-enabled devices. Voice search, by construction, affords information about the user that is not available in conventional text search. Most notably, implicit information obtained from raw audio can be used to tailor the underlying content retrieval system to more closely match user preferences. To maximise utility with minimal user input, however, an optimal voice search system should be able to perform tasks of this nature with minimal supervision. In this paper, we present a set of novel methods for inferring information about users of voice search without explicit enrolment and demonstrate subsequent enhancements to personalisation. Further, we show how this work helps in reducing computational cost by reducing the number of possibilities considered by our natural-language understanding (NLU) system.

Introduction

Due to the rising popularity of virtual assistants like Amazon’s Alexa, Apple’s Siri, and Google Assistant, the modern consumer has grown accustomed to using conversation services while interacting with electronic devices and moving around the home to achieve tasks that would otherwise require “hands-on” human interaction, such as typing a query into a search engine or changing a music playlist. In addition to making these tasks simpler, the use of voice could provide additional contextual information (age, gender, sentiment, etc.) about the speaker that may then be further used to enhance the user experience.

In this paper, we describe a novel, efficient and effective strategy that could be used to enhance the discovery experience for voice remote users by attaching context to the string of text passed to backend search. The basis of the solution centres around personalisation at the level of an anonymous but individual user in a household. To present our strategy, we have broken the end-to-end solution into multiple parts. First, we obtain information about the speaker using certain acoustic features extracted from the audio. Next, we fold in metadata that we have crafted around the offered content. This component is crucial for inferring the relevance of content to the individual user. In the scope of our research, this was limited to age-based relevance. The third step is to utilise what we have gleaned from the previous steps in real-time to optimise natural language understanding and backend search. In doing so, we close the loop of enhanced entertainment discovery by matching a user to a collection of relevant content - in the absence of explicit user identification.

In what follows, we provide a discussion of the steps outlined above, including an overview of the dataset used during development, details related to the underlying learning algorithms used, and various measures of performance. We complement this work with results from case studies and explain operational benefits from our approach. In closing, we summarise our work and highlight areas of future research.

Download the full paper below