Broadcasters and other content creators are beginning to realise the transformative potential of AI for subtitling, captioning and transcription – especially in the workload-intensive era of OTT and VOD.
From the earliest days of Artificial Intelligence (AI) and Machine Learning (ML) entering the debate around broadcast workflows, there has been a recognition of its potential to enhance and automate the provision of access services. As defined by UK communications regulator Ofcom, access services constitute “additional facilities supplied by broadcasters that are designed to allow hearing and visually impaired consumers to gain access to TV content”.
In 2020, this translates to increased use of AI and ML to simplify the creation of subtitles and captions, as well as transcriptions of interviews and other content for internal use. And that’s not all – as Anupama Anantharaman, vice-president, product management at Interra Systems, remarks: “Today we see broadcasters using ML/AI-based platforms to automatically generate missing captions, check on caption and audio alignments, ensure the accuracy of captions, and maintain correct punctuations in captions.”
But while there is every expectation that these technologies will allow more content to be covered by access services, no one expects them to entirely remove the need for a manual human review stage any time soon. At present, accuracy for automatic speech recognition tends to be in the 85-90% region, while noisy environments and unrecognised accents or inflections can result in errors. Hence, a combination of AI/ML and manual review might well “always be the best approach,” suggests Anantharaman.
Until such a time that AI/ML can deliver close to 100% accuracy, it’s also the most robust way to meet the legal requirements surrounding access services that exist in many countries. It’s important to note that on-demand services are by no means covered by the same stipulations at present, although the UK is among those to be moving in that direction – with the Digital Economy Act 2017 paving the way for Government to introduce more measures to improve accessibility of access services. In time, AI-based solutions will surely help broadcasters deal with the increased workload that these changes will generate.
Given AI’s obvious potential here, it’s not surprising that there has been a steady roll-out of software solutions over the past few years. Arguably one of the most innovative is the AI-driven transcription, subtitling and video editing platform Simon Says.
From its initial starting point as a cloud-based transcription website and applications, Simon Says has evolved to include a range of solutions geared towards transcription, translation and captioning. The platform integrates with all the major video editing applications – including Adobe Premiere Pro, Avid Media Composer, Apple Final Cut Pro X and DaVinci Resolve – and can fit into a host of post-production workflows. Among recent developments, a V2 update to the offline/non-cloud solution, Simon Says On-Prem, was announced at the start of February.
Shamir Allibhai, founder and CEO of Simon Says, points to four underlying AI technologies – speech recognition, auto punctuation, speaker identification and (where needed) translation – that make the solution applicable to a host of access and transcription services. “In terms of the core benefits, the focus is on the speed and efficiency of the work,” says Allibhai, noting the potential of AI to accelerate the completion of repetitive tasks that can often “slow down workflows”.
The company has especially high hopes for AI becoming “a standard part of the workflow” for internal tasks like the automatic transcription of interviews and daily rushes. In terms of activities like translation and subtitling where accuracy needs to be as close to 100% as possible, “I think that it will be AI doing the first pass, after which humans are involved in reviewing and making changes. But in terms of what might be described as the groundwork, AI will absolutely be very important.”
Hybrid or automated
A combination of ML, auto time stamping and speech recognition technology underpins Interra Systems’ BATON Captions solution, which is designed to automate the entire captions workflow – from caption generation to QC, auto corrections, review and editing. According to Anantharaman, it is able to speed up caption creation and verification processes for both live and VOD content. Another strength is that “when content is delivered in multiple video quality levels within OTT video streams, the captions maintain a high quality.”
The end-result, he says, is that broadcasters can make more content “accessible to viewers with hearing impairments easier and more affordably than ever.”
With regard to overall adoption of automated access services, Anantharaman expects that current “manual or hybrid” approaches will persist for some time. After all, implementing a “completely new way of captioning and subtitling” can be complex as well as demanding of time and resources.
“Still, we do see this as a temporary issue,” says Anantharaman. “The industry is moving towards automating media functions to improve operational efficiency and optimise resources. OTT streaming has increased content volumes and captioning requirements for different languages. So in the next few years we predict that manual operations will be replaced by hybrid or completely automated, AI-based approaches.”
Although full automation for critical services is still a long way off, accuracy is now generally good enough to support slightly less exacting tasks such as transcription of content for internal use by producers and journalists. Indeed, there has recently been an abundance of new solutions in this area – including iconik’s 1-Click Video Transcription.
Parham Azimi, CEO of iconik, remarks: “Transcriptions can easily be added in iconik using AI to convert voice to text. The text is timestamped, attributed to speakers, and then becomes metadata for that video. This means you can locate a video even if you can only recall part of a spoken phrase.” Acknowledging that “speakers are not always clear” and this can cause challenges to AI, the solution also offers a “simple and intuitive” process for editing text or speaker attributions.
Jonathan Morgan, CEO of Object Matrix, believes that whilst AI is “extremely powerful”, it is “not yet perfect and not ready for widespread adoption” for live production tasks, in particular. But the company – whose MatrixStore object storage solution integrates with AI solutions to enable AI metadata tagging – does see immediate potential for archive-related applications. “AI-generated speech-to-text can be of particular use on archived media since – [whether] standalone or combined with other AI-generated video analysis – they open up a wide range of search capabilities.”
Dejero, meanwhile, is exploring the use of AI for live content, especially for news organisations who are creating hybrid digital-traditional newsrooms and seeking “story production efficiencies”.
A POC that leverages AI and the cloud is ongoing, with Yvonne Monterroso, Dejero director of product management, indicating that increased content production and “resource constraints” will surely lead to more widespread use of AI. “Whilst there is a debate over whether it’s currently good enough for close captioning,” she says, “it can certainly be used, for example, in generating transcripts of a press conference very quickly. Meanwhile, the quality of speech to text AI has increased dramatically, and will continue to improve.”
Outlining its pathway to improving access services for the streaming era in 2017, Ofcom said there was now an opportunity to ensure that content provides “consider not just the quantity, but also the quality and usability of their access services.” With their accuracy and capabilities improving all the time, there is little doubt that AI and ML – in conjunction with main-stage human review processes – will be integral to delivering more extensive, high quality services.
- Read more: Themed week: Media Supply Chain