Technical paper: This paper introduces principles for simplification of learned models targeting improved transparency in implementing machine learning for video production and distribution applications.
Machine learning techniques for more efficient video compression and video enhancement have been developed thanks to breakthroughs in deep learning. The new techniques, considered as an advanced form of Artificial Intelligence (AI), bring previously unforeseen capabilities.
However, they typically come in the form of resource-hungry black-boxes (overly complex with little transparency regarding the inner workings). Their application can therefore be unpredictable and generally unreliable for large-scale use (e.g. in live broadcast).
The aim of this work is to understand and optimise learned models in video processing applications so systems that incorporate them can be used in a more trustworthy manner. In this context, the presented work introduces principles for simplification of learned models targeting improved transparency in implementing machine learning for video production and distribution applications.
These principles are demonstrated on video compression examples, showing how bitrate savings and reduced complexity can be achieved by simplifying relevant deep learning models.
Machine Learning (ML) has demonstrated superior performance compared to traditional methods when applied to a variety of challenging computer vision and image processing tasks. Methods based on Convolutional Neural Networks (CNNs) have been particularly successful in solving image classification and object detection problems, as well as regression problems including image segmentation, super-resolution and restoration (1).
When applied to visual data, CNNs identify patterns with global receptive fields that serve as powerful image descriptors. The deeper the network, the larger the receptive field, which in turn can lead to the network capturing more complex patterns from the input data. CNNs have set state-of-the-art results in large scale visual recognition challenges (2) and medical image analysis (3). The number of layers of CNN models that have set benchmarks in classification challenges, have been continually increasing, with the VGG19 model (4) containing 19 layers and ResNet (5) containing over 100. These deep architectures act as robust feature extractors and can be used as pre-trained models in related problems. The learned knowledge is applied to a different ML model, raising the accuracy of the related task at hand. For visual content enhancement, applied models such as automatic image colourisation (6), use a pre-trained VGG19 model for improving the perceptual quality of the outputs, while others like image super-resolution (7), base their architecture on previous Deep Neural Network (DNN) approaches.
These developments have led to considerable research efforts focused on ways to integrate ML solutions into next generation video coding schemes. Tools based on both CNNs and DNNs with fully connected layers are increasingly being deployed in various newly proposed video compression approaches (8-14).
The increasing demand for video distribution at better qualities and higher resolutions is constantly generating a need for even more efficient video compression. One of the biggest efforts in this field has been related to the development of the next-generation Versatile Video Coding (VVC) standard (15). VVC is a successor to the current state-of-the-art High Efficiency Video Coding (HEVC) standard (16) and aims at providing up to 50% reductions in bitrate for the same objective and subjective video quality. While investigating how to improve such video compression tools by using ML, it has become evident that the main drawback of the application of DNNs is the sheer complexity of their various forms. Moreover, DNN solutions shouldn’t be blindly applied to production and distribution applications. Their methods of manipulating input data need to be properly explained and understood, to mitigate potential unexpected outcomes. By making DNNs transparent, it gives an opportunity to build up trust with these methods as we can see what is happening in the network.
To successfully design DNNs for practical applications, we are therefore researching how to address the complexity and transparency issues of DNNs. The approaches presented in this paper utilise specific forms of DNN interpretability which assist in the design of simplified ML video processing tools. As a result, the proposed methods present low complexity learned tools that retain the coding performance of non-interpreted ML techniques for video coding. Furthermore, the methods are transparent and allow instant verification of obtained outputs. With the demonstrated approaches, we have developed and confirmed principles that can serve as guidelines for future proposed ML implementations in video processing.