No comments

ABSTRACT

Latency remains one of the most significant factors (1) in the audience’s perception of quality in live-originated TV captions for the Deaf and Hard of Hearing.

Once all prepared script material has been shared between the programme production team and the captioners, pre-recorded video content remains a significant challenge – particularly ‘packages’ for transmission as part of a news broadcast. These video clips are usually published just prior to or even during their intended programme – providing little opportunity for thorough preparation.

This paper presents an automated solution based on cutting-edge developments in Automatic Speech Recognition research, the benefits of context-tuned models, and the practical application of Machine Learning across large corpora of data – namely many hours of accurately captioned broadcast news programmes.

The challenges in facilitating the collaboration between academic partners, broadcasters and technology suppliers are explored, as are the technical approaches used to create the recognition and punctuation models, the necessary testing and refinement required to transform raw automated transcription into broadcast captions and methodologies for introducing the technology into a live production environment.

INTRODUCTION

Over the last 30 years the volumes of ‘SDH’ captioning (Subtitles for the Deaf and Hard of Hearing), both live and pre-recorded, have increased considerably across the globe. As coverage approaches 100% in certain markets, the attention of the regulatory bodies has shifted from volume to quality (2). A closed caption is fundamentally a short section of timed text, and three simple measures of quality can be viewed as:-

its fidelity to the original spoken word
the textual accuracy of the transcript
and the timeliness with which it’s presented

For a pre-recorded programme it is possible to ensure that fully accurate, perfectly timed verbatim captions can be presented; with the majority of productions there is sufficient time between the completion of editing and publication for the captioner to create and review the caption file to the required standard.

Live programming presents a considerable challenge; the caption data will need to be streamed in real-time from the captioner to the point of insertion, and synchronisationmethods such as timecode cannot be relied upon.

Whilst a sizeable proportion of a live programme may be spontaneous or ad-libbed, possibly 100% for sporting events, much live programming relies on a running order, sections of pre-scripted material, and prepared and pre-edited video ‘packages’.

Much effort has been made by live programme makers to ensure that teams producing captions have access to running orders and scripts in advance of broadcast – most of the major broadcasters in the UK, France and Australia provide access to such data for their captioners.

This data can be used directly (with suitable correction to phonetic spelling and grammar) or to prepare the captioning platform for any names and places likely to be mentioned on air, but it is less common for access to be made available to video content pre-broadcast. Where this access is granted, the dynamic nature of live production, especially in news, can mean that there is little time between completion of the clip and its inclusion within the broadcast.

Therefore there is a considerable challenge to transcribe and prepare captions for this prepared content within the time available. Techniques based on live origination can be used to accelerate the production process (such as respeaking – a speaker-dependent ASR technique used to transcribe the spoken word into captions via dictation), but these take production effort (which may be focussed elsewhere) and time.

Ideally it should be possible to utilise an Automatic Speech Recognition (ASR) engine to generate the transcripts, but these typically produce transcripts that require more editing time to reach broadcast quality than manual origination would have taken.

Previous experimentation as part of the EU- Bridge Speech Technology consortium indicated that ASR engines trained on very specific domains, such as Broadcast Weather, could achieve very high levels of accuracy – typically over 90% (3). It was recognised that the broader News domain would impact the overall accuracy of such an engine, but it would prove usable if a target accuracy of at least 95% was achieved, or 90% accuracy with fewer than 10% of errors being omissions, for the clips processed, with system confidence scores being used to pre-filter the poor quality transcripts.

Scores lower than these would require sufficient manual review and editing that it would render any time-saving by using ASR worthless; for this project scores were calculated using the approaches described in the section below.

This paper will illustrate how such an automated ASR engine was devised and created with research partners, showing the benefits of context-tuned models and the practical application of Machine Learning across large corpora of data – namely many hours of accurately captioned broadcast news programmes. The challenges of introducing such a system into the live production environment will be discussed, as well as the testing required to ensure that each iteration of the engine reaches the target accuracy levels. The paper concludes with a summarisation of the success so far and some considerations for future applications.