IBC2022 Tech Papers: An ultra-low bitrate video conferencing system with flexible virtual access patterns

No comments

IBC2022: This Technical Paper showcases an ultra-low bitrate video conferencing system.

Abstract

The demand for remote work and online entertainment is surging annually, placing heightened challenge on bandwidth usage and experience quality of applications such as video conferencing. Video codecs in traditional video conferencing systems typically utilise a block-based hybrid coding architecture, which often have sub-optimal rate distortion performance and computational resource consumption in these scenarios. In addition, in low bit rate scenarios due to low bandwidth networks, traditional codecs may lead to a disastrous experience. In this paper, we propose an ultra-low bitrate video conferencing system with flexible virtual access patterns. Conventional video codecs are partially or fully replaced to get ultra-low bitrate while ensuring a smooth communication experience. Furthermore, the three access patterns, face encoding, realistic and virtual avatar, can be driven with either video or audio modality and generate videos in different domain, providing a possible future video conferencing paradigm. The video captured by camera is not necessarily transmitted, protecting privacy of the users. Experiments demonstrate the excellent rate distortion quality and real-time performance of the proposed system.

Introduction

Since the emergence of Coronavirus pandemic in 2020, the industry and academia have seen substantial growth rates in terms of increased consumption and accelerated innovation topics, from remote collaboration to online entertainment. While the burst of demand for video conferencing and live entertainment presents massive opportunities, it also poses hungry demand for bandwidth. Conventional video systems typically encode captured video with a block-based hybrid encoding scheme, which is versatile and stable. However, for some specific scenarios such as video conference or virtual avatar in live streaming, block-based coding scheme is insufficient to decrease the semantic redundancy. We have the following findings and analyses:

In video conferencing scenario, the video to be encoded is mainly about talking faces in a fixed background. General video encoding schemes, such as High Efficiency Video Coding (HEVC/H.265), Versatile Video Coding (VVC/H.266), and AV1, is designed for arbitrary videos and focus on recovering pixel-level fidelity. However, we argue that the general codec is sub-optimal in video scenarios for the following reasons. First, the background images in video conferencing human faces are often static and keep unchanged while the audiences will focus on the face region during the conference. Second, human faces commonly share similar structures and semantic meanings (e.g., eyes, mouth, and nose, etc.), which provide the opportunity to recover face details from less semantic cues by learning priors from face data sets. Recently, deep learning methods have generation capability based on abridged information, promising potential in face video compression. These methods typically use some sparse representations like key points in place of some or all of the video frames, and use deep learning technique to recover these frames before rendering. In industry, NVIDIA has also released platforms and suites for audio and video communication with Artificial Intelligence (AI), such as NVIDIA Maxine, where the AI video compression solution also transports key points of the face at the sender and generates a reconstruction at the receiver with AI methods.
As the animation industry has made a sharp increasein the market, the VirtualYouTuber (VTuber) have also grown rapidly. Many VTubers have shown their commercial value in the live streaming market and have a large number of fans on numerous social platforms such as YouTube, Niconico and Bilibili. In addition to live streaming, using virtual avatars to access video conferencing is becoming a new trend. Standardisation organisations have also paid attention to such trends. Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) publishes the second version of Multimodal Conversation (MPAI-MMC) standard, where one use case is Avatar-Based Video conference (ABV). The video to be encoded in virtual video conferencing and live streaming is just a virtual avatar doing slight movements in fixed scenes, which is similar to video conferencing except the faces. Furthermore, the virtual avatar itself is driven by some key points extracted from the human actor, which takes less data volume than encoded frames.Therefore, transferring key points and rendering them in real time on the client side may be an attractive solution to reduce bandwidth consumption.
With the rise of new concepts such as metaverse, besides face encoding and virtual avatar, a realistic avatar is also a possible replacement of user’s real face in video conferences in the future. In this scenario, the conference system allows the user to animate any predefined avatar or his/her own face image via her real-time facial dynamics or even only his/her audios. To enable various access patterns, we propose an effective method for photo-realistic talking face rendering in video conferencing system. Our proposed method utilises either the videos captured by a camera or the audio signal recorded by a recorder to synthesise virtual avatar or realistic talking face videos. It is worth mentioning that both the modalities of visual signal or audio signal are acceptable. With the proposed pipeline, one can join a conference with his/her voice and transfer the synthesised videos, which abolishes the necessity of real-time camera recording and ensure the user privacy not to be violated.

To overcome the sub-optimality of traditional encoding schemes in the above scenarios, and to explore new forms of entertainment, this paper proposes an ultra-low bit rate video conferencing system with various virtual access patterns. In this paper, we combine the latest developments in face video compression, virtual avatar and realistic talking face generation method with real time communication (RTC) to provide a practical video conferencing system.The main contribution of this paper are as follows:

This paper combines face video compression, virtual avatar and realistic face rendering with RTC. To the best of our knowledge, we make novel attempt to explore a new paradigm for video conferencing system which is not seen in prior studies.
The proposed video conferencing system provides acceptable and better results than conventional video codec schemes under ultra-low bitrate constraints, meet the needs of real-time communication in bandwidth-constrained network environments.
Our developed prototype system is ready for practical deployment and will soon be released on https://github.com/sjtu-medialab/virtualConference.

The remainder of this paper is organised as follows. We first discuss related works in the following section. Next, we demonstrate the architecture of the whole system and describe each of the modules in detail. Then, we conduct adequate experiments to indicate the performance of the system in bitrate and latency aspects. Finally, we summarise the entire work and discuss the future directions on our video conferencing system.