The production of media content across several languages and platforms is both time consuming and complex. Microphones, sound booths and arrays of editing software are typically required to generate translated audio tracks.

This paper presents a one-stop solution to simplifying this workflow. With a particular focus on the translation of audio tracks contained in video files, this paper describes an innovative workflow that leverages commercialised Text-To- Speech voice synthesis and a prototypical system running in production.

This workflow bypasses the need for microphones, video or audio editing software and allows a single editor to generate multiple mixed-gender voice-overs.

A lightweight markup language is presented which helps editors to fine-tune synthetic voices. The balance between automation and editorial and linguistic quality will be also examined.

The majority positive feedback received from journalists and audiences indicates that the prototype and its underlying language technology have the potential to become part of the multilingual video production process.


The plethora of digital platforms makes information available in a great number of languages, and the expectation of audiences to be able to consume media in their own languages is growing. International broadcasters and streaming services, in return, increasingly reach out to their global audiences in multiple languages.

In global newsrooms, for example, multilingual journalists not only produce original news reports, but they also re-version existing video content into the language of their audiences.

In order to meet the growing demands from our audiences, innovative production workflows and new tools must be developed to assist language editors in the translation of video content.

Current translation workflows are complex, reliant on a variety of resources and can be expensive and time consuming. This paper presents a simplified process that introduces Text-To-Speech (TTS) voice synthesis and computer-assisted translation into the re-versioning of video content.

Most of us have come across either of these technologies: online machine translation, ‘digital personal assistants’ and translation apps for smartphones, in language learning tools and many others. The quality is now advanced enough to be trialled within the production process and gauged by producers and audiences.

This paper begins with a brief description of the typical re-versioning workflow to identify the steps that can be rationalized.

Then, with a focus on voice synthesis, this paper examines a prototypical system, developed for a pilot online service, which successfully integrated language technology. Particular attention is paid to certain linguistic aspects of voice synthesis and this paper examines how this prototype deals with differences in the quality of phonetic and prosodic voice performance.

A lightweight markup language is presented which has been designed and implemented for editors to fine-tune voices.

The paper also describes how the prototype handles the generation and balancing of audio tracks in this workflow which no longer relies on any video editing software, studios or recording equipment.

The paper finally presents the user feedback received during the course of this pilot.

The prototype was tested by two user groups in Japanese and Russian: language journalists (the test users) who used the prototypical voice-over tool and the audiences who watched the videos online. They provided feedback on the quality and intelligibility of the voices and on how they were perceived in voice-over tracks of news videos. The paper closes with concluding remarks.


This paper exemplifies the new technology with the re-versioning of news video packages as were used during the pilot.

These packages are illustrated news reports (usually between 2 to 3 minutes long), each of which is a composition of interviews, vox pops, edited footage and a journalist’s introduction and links.

The audio contained in news packages is usually a mix of the natural sound track (e.g. crowd noise, soundbites) and a pre-recorded voice-over track which narrates the story. Depending on the video content, the voice-over track can contain several different voices.