The use of AI to automate the production of shot lists and use the expanded metadata to create an AI-assisted first cut is being explored by an IBC Accelerator project led by the Associated Press.
Every year Associated Press (AP) produces around 15,000 hours of live video news output from which a further 3,000 hours of clips are created for distribution. To do so, thousands of hours are spent manually shot-listing and manually transcribing that content. When there aren’t the resources to shot-list and transcribe everything fully, good content can get buried in the archive.
Sandy Macintyre, AP’s VP News, describes the aim of the Accelerator he is leading as trying to better signpost good pictures and good sound-bites while “removing the ‘grunt work’ and liberating people to do more creative things with the editing and production time.”
Outside the current and unprecedented coronavirus pandemic, McIntyre observes that for agencies such as AP, Reuters or AFP politics is usually the biggest news genre and possibly accounts for as much as 30% of annual news output. Making political news gathering more efficient would be scalable and transferrable worldwide.
Accelerator title AI-automated video shot listing
Champions AP (Associated Press, project lead), Al Jazeera Media Networks, BBC
Participants Vidrovr, Metaliquid
The project is supported by the BBC and Al Jazeera and compliments another Accelerator led by Al Jazeera that is looking at practical uses of AI and machine learning (ML) for compliance monitoring.
AP has made an archive of US political content including footage of numerous candidate and presidential debates, rallies, press conferences and campaign trail events available to the Accelerator participants. A wide-ranging catalogue of attributes that the system needs to learn has been compiled including features such as the faces of the main players; steps, stage, podium and so on.
The aim is to automate the shot listing process and then construct foundational edits defined by “recipes”. For example: President Trump walking up to a podium; cut-away of supporters; cut-away of press cameras; soundbite; cut-away to crowd reaction; Trump exits stage; rope-line of glad-handing. The team will also experiment with sentiment analysis where appropriate: for example, to differentiate between supporters and protestors by analysing placards.
Current AI capabilities
Having monitored the emerging AI scene for some time, MacIntyre perceives that speech-to-text transcription is probably the best developed AI capability at present followed by facial recognition then object and voice recognition. Sentiment analysis is lagging.
However, improving transcription capabilities further could yield important incremental benefits. For events such as the Democratic Primary debates which, early on, involved nine or ten candidates, achieving frame-accurate identification of speakers would be invaluable.
From improved transcription capabilities it should be possible to more easily search for and publish soundbites containing keywords and would contribute to a much more useful Edit Decision List (EDL) tool.
“The ambition is to get to a working prototype that is able to take raw video and processing it in a way that will create close to a final edited product while removing a tonne of time and effort,” says MacIntyre. The content could also be better because it should be quicker to search all the available footage and identify the most appropriate, relevant and newsworthy content.
Vidrovr is doing the heavy lifting on the training dataset. The New York company’s platform takes all of the data signals that come with a video – text, audio, visual, motion – to provide a detailed understanding of what is happening in each frame. This capability can be used on a live stream to alert a user when something they are interested in appears or to find the optimal clip when searching through an archive.
Metaliquid is bringing voice recognition and a user-friendly interface to the Accelerator. Founded in Italy in 2016, Metaliquid offers a proprietary deep learning framework which can be delivered in cloud or on-premises. The company’s AI algorithms can extract descriptive time-coded metadata in real-time to identify and recognize thousands of different content attributes.
“What broadcasters need is a flexible and efficient feature-extraction tool, able to analyse a large amount of data and to react in real-time. Video content analysis and the subsequent extraction of metadata can be used to improve search in archives and real time footage, selecting clips of interest automatically, and boosting content production.” Explains Giulia Morra, Metaliquid’s US Country Manager.
Metaliquid services use REST APIs and output json files containing time-coded information on what is happening frame-by-frame. The solution can be integrated with any media asset management, production or workflow software packages.
MacIntyre believes the Accelerator is a great way to discover the current limitations of AI and machine learning and to better understand what is required to teach an ML system how to improve. “It’ll be interesting to discover what it succeeds with, what it struggles with. For example, we know that when the principal in a video of – Trump, Biden, whoever – is inside a big sea of faces it’s going to struggle to pick him up, even if it knows who he is. So we’ll have an opportunity to better understand the impact of resolution, pixilation and so on.”
Noting that every video-first news and media company is currently stretched thinly
Joe Ellis, Co-founder & CEO of Vidrovr says, “The beauty of the accelerator program is that you can get every stakeholder into a virtual room to work collaboratively and constructively to think about where we want to be as an industry in 12-18 months and then chart the roadmap to get there. It’s the perfect opportunity to tackle big hard problems to bring on real transformation.”
- Read more: Cracking the code – AI enters the frame