We describe an integrated platform that combines a search and recommendations system of digital media with a novel conversation interface that enables users to use natural-language conversation for performing a variety of tasks on the digital content and information retrieval relating to meta-content.

This advanced platform is built over a knowledge graph that consists of millions of tagged entities, along with structured relationships and popularities crawled and ingested from multiple sources, continuously evolving over time.

The voice system uses a unique extensible architecture that combines natural language processing (NLP) techniques with named entity recognition of the knowledge graph determining both intent as well as entities extracted from user queries.

Relationships between entities in the knowledge graph aid in identifying the right entities thereby creating meaningful, contextual, and temporally relevant interpretations.


In the last couple of years, there has been a rapid growth of second screen devices such as mobiles and tablets to drive digital entertainment systems.

At the same time, we are also witnessing a robust technology from the speech recognition community for converting voice audio to spoken text.

The fusion of these two is now making voice interfaces and natural language a viable and practical solution for several tasks in the living room; many of these tasks would have involved inputting text on more cumbersome and text-input constraint devices such as TV remotes and mobiles.

It is just a matter of time now for voice interfaces to become the default interface for several devices in the future digital home.

In this paper we describe a natural-language interface to perform a variety of functionalities pertaining to video and music content in relation to end-user tasks pertaining to digital media content.

Examples of such tasks or intents involve retrieving search results and getting personalized recommendations, driving common TV-control commands, getting more information from the media meta-content and answering trivia questions pertaining to them, and checking availability of the content in a channel lineup or on-demand catalog.

Along with the rich set of query intents, the conversation system supports queries that involve entities spanning a comprehensive knowledge graph.

Examples of such queries are “Show me some old Bond movies without Roger Moore” or “Who played Tony Montana in Scarface?”

This problem of building conversational question answering (QA) systems has been a hot topic in industry and academia for several years (1, 2).

A QA system aims at providing precise textual answers to specific natural language users’ queries rather than typical search engine results that give a set of matching documents.

Of late, many of these systems are based on ontology wherein the knowledge-based data has a structured organization defined by an ontology (3, 4).

Users could raise questions in natural language and the system will return accurate answers to users directly after question analyzing, information retrieval and answer extraction.

Ontology knowledge base provides a convenient way to incorporate semantic understanding of user queries, but the natural language needs to be mapped to the query statement of ontology.

Examples of such ontologybased QA are AquaLog (5), QASYO (6) and more recently Siri by Apple.

AquaLog is a semiautomatic QA system that combines natural-language processing, ontologies, logic, and information retrieval technologies. QASYO is a QA system built over Yago that integrates the ontology of WordNet with the facts derived from Wikipedia.

In all these systems, the input natural-language query is first translated to an intermediate representation compatible with the ontology and this intermediate query is then used to find the final results.

In the current work, we use the ontology based on the Rovi Knowledge Graph (RKG) that incorporates factual information of all ‘known’ or ‘named’ things.

This includes countless movies and TV shows, music albums and songs, as well as countless known people such as actors, musicians, celebrities, music bands, known companies and business establishments, places, sports teams, tournaments and players, etc.

All the facts pertaining to these entities are crawled from multiple sites such as Wikipedia, Freebase, and many others and correlated so as to create a unique smart tag (with a unique identifier) to represent each entity in the RKG.

Two entities can stand in a relation and there are multiple kinds of structured relationships that exist in the RKG such as movie-director, team-player, etc.

The relations between the entities also get created by aggregating factual knowledge from several structured data sources and are further augmented with unstructured relationships using data-mining techniques such as analysis of hyperlink structures within Wikipedia.

The facts of the RKG represented via entity-identifiers are hence separated out from the language-specific lexical ontology such as WordNet.

The lexical ontology is mainly used in understanding the intent of the query through natural language parsing and pattern matching techniques; whereas the named entity extraction is based on the RKG.

The intent and entities are then combined together to retrieve the final answer to the user query.

Though intent and entity recognition are very closely dependent on each other, the conceptual separation of these two components give more modular organization and also permits flexible extensions of the conversation system.

For example, consider the queries “How long is iron man?” and “How long is San Mateo?” In the former query, intent is “running time of movie” whereas in the latter query, the intent is “distance”.

This intent- ambiguity here can be resolved with the help of named entities. On the other hand, consider the query “play lights?” While “lights” can refer to several concepts, the entity-ambiguity is resolved by using the intent “play” and prefers either songs or movies.

Another unique aspect of the proposed voice system is the seamless handling of session continuity, where entities as well as intent spread across multiple queries.

For example, the user may start with a broad query such as “how about some good action movies” and then narrow the search in the next query as “any of these by Tom Cruise?” and so on.

The system is able to intelligently tie in the context of the first query in the interpretation of the second query.