Artificial intelligence can have a democratising effect on content creation, removing a significant proportion of the cost and effort involved. This concept was discussed in detail by a panel of industry speakers introduced by director of Logical Media Nick Lodge at IBC2022.
“Video content is a very powerful source of information, essential for analysis,” said TVCONAL founder Masoumeh Izadi.
Her company has developed an AI and machine learning powered platform that rapidly analyses sports footage, with a current focus on cricket.
“Data plays a central role in sports,” she said, “Put together to create game semantics and metadata labels, and that would make your content searchable, customisable, and you can extract value and monetise it.”
Read more Artificial intelligence in broadcasting
TVCONAL has analysed 168 cricket matches over an eight-month period. The end result in each case is a content platform that can be searched for game events such as specific batting or bowling techniques.
Each match “needs to be sliced into what we call units of analysis, which is different in every sport,” said Izadi. “For example, this could be shots or pitches, or delivery or a stroke — depending on the sport… At the heart of this learning model is learning to slice.”
The platform uses “machine learning and computer vision” algorithms to recognise these game events based solely on the content in the video itself.
“The solution we are proposing is to use video analytics, which at the moment is very, very advanced, to the point you can understand and discover what is in the content. In sport content, this would mean identifying and locating different kinds of objects, being able to track those objects, detect players, the type of player, track their movements, etc - whether that’s their position or the key points of their body — just from the content of the video.”
AI recognises batting techniques based on the movements of the player, or a 6-point shot by analysing when the ball crosses the boundary, with upwards of 95% accuracy. TVCONAL has developed the system recording top-tier multi-cam cricket productions, basic 3-camera shoots and single camera recordings, where accuracy with a more limited analysis model can reach above 99%.
Cost and time-saving technology
This form of AI-powered video analysis takes much of the cost and effort out of content tagging and categorisation.
“Even in top tier productions this is very time consuming, labour intensive, and prone to human error. Plus for archive content accumulated over the years, it is a nightmare to go through,” said Izadi.
There are numerous uses for this form of AI content analysis, and TVCONAL is currently focused on applications within the sport itself. It is in discussions with six pro cricket teams in SE Asia and 10 cricket centres of excellence in Malaysia about the use of its platform as a training tool.
Izadi calls it “a move to democratise sport in the digital age.”
“[AI technology can] give any sport team the privilege the big sport teams have. Saving cost and saving time from productions, empowering the production team in their operation, and unlocking their ability to produce more engaging, more interesting sport content.”
TVCONAL used more than “20,000 samples” to train its cricket machine learning algorithms, and is planning to branch out into other sports in future. “We’re looking into racquet spots, tennis and table tennis,” said Izadi.
Next-generation AI sports commentary
She also demonstrated experimental auto-generated commentary during IBC2002, building upon game event analysis with automatically generated sports presenter-style speech to match the on-screen action.
However, the true cutting edge of speech synthesis was demonstrated by Kiyoshi Kurihara from NHK, the Japan Broadcasting Corporation.
NHK currently uses text-to-speech generation to accompany news reports on the NHK 1 TV channel, live sports commentary online and for weather forecasts on local radio stations. It provides a professional line read, through an “AI anchor”, but the actual input is typed or automatically generated rather than spoken by an actual person.
Kiyoshi Kurihara explained the process as breaking down the written words into graphemes, recognising their respective sound, or phoneme, and then converting these phonemes into a waveform. This is the audio clip of generated speech that can be broadcast over the airwaves or synced to a video.
The AI model that can produce realistic line readings “requires 20 hours of speech data per person,” according to Kunihara. “It’s every hard work,” he added.
This is particularly true using a more traditional method. “Training is difficult for two reasons. First, text-to-speech requires high quality speech [recordings], and this requires four people. An anchor, engineer, director and data annotator. Second, regarding quality it is important to produce high quality speech synthesis, because noise will also be re-generated,” he explained.
The “leading edge” component of NHK’s process removes much of this heavy workload. “This manual process can be eliminated,” said Kunihara, using a supervised learning approach.
The birth of an AI Anchor
NHK’s AI Anchor model is created using real speech from radio already broadcast. One difficulty here is that a radio programme may feature a mix of music and speech and, naturally, only the speech can be used to build the tech-to-speech profile.
“We have developed a method to automatically retrieve one sentence clips using state of the art speech processing,” said Kunihara. The broadcast is broken down into small segments, cutting out music and other extraneous sections, which become the units of the units used to train the AI voice synthesiser.
“Wave form synthesis uses deep learning and converts phoneme to waveform,” explained Kunihara. And by automating some of the most challenging parts of the process, NHK is able to affordably and efficiently develop virtual radio and TV presenters.
Local radio provides a great example of when this can not just reduce costs, but increase the usefulness of the broadcast. “There are 54 [radio] stations in the local area and this was costly for them to provide local weather,” said Kunihara. NHK automatically generates scripts using weather report information for each local station, and then employs its TTS (text-to-speech) system to create ready-to-broadcast bespoke weather report audio.
NHK used similar techniques for mass sporting events too. “In 2018 and 2021 we provided live sports commentary using TTS during the Olympic and Paralympic Games and distributed them over the internet,” said Kunihara. The team used metadata from the official Olympic Data Feed to auto-generate a script for each feed.
This ties into the metadata creation TVCONAL creates in its cricket footage analysis, demonstrating how AI technologies can often work hand-in-hand in this field.
Kiyoshi Kurihara of NHK and TVCONAL founder Masoumeh Izadi were speaking at an IBC2022 session titled Technical Papers: How AI is Advancing Media Production. It was moderated by