In June 2018, Fraunhofer HHI together with Studio Babelsberg, ARRI, UFA, and Interlake founded the joint venture Volucap GmbH and opened a commercial volumetric video studio on the film campus of Potsdam Babelsberg. After a testing phase, commercial productions started in November 2018. The core technology for volumetric video production is 3D Human Body Reconstruction (3DHBR), developed by Fraunhofer HHI. This technology captures real persons with our novel volumetric capture system and creates naturally moving dynamic 3D models, which can then be observed from arbitrary viewpoints in a virtual or augmented environment. Thanks to a large number of test productions and new requirements from customers, several lessons have been learnt during the first year of commercial activity. The processing workflow for capture and production of volumetric video has been continuously evolved and novel processing modules and modifications in the workflow have been introduced.
In this paper, some recent developments of the professional volumetric video production workflow are presented. After an overview description of the capture system and the production workflow, several enhancements of the workflow are presented resulting from the experiences gathered after one year of commercial production.
Capture system overview
The capture system consists of an integrated multi-camera and lighting system for full 360-degree acquisition. A cylindrical studio has been set up with a diameter of 6m and height of 4m. It is equipped with 32 20MPixel cameras arranged in 16 stereo pairs. The system completely relies on a vision-based stereo approach for multiview 3D reconstruction and does not require separate 3D sensors. 220 ARRI SkyPanels are mounted behind a diffusing tissue to allow for arbitrarily lit background and different lighting scenarios. This combination of integrated lighting and background is unique. All other currently existing volumetric video studios rely on green screen and directed light from discrete directions, such as the Mixed Reality Studio by Microsoft and the studio by 8i.
In order to allow for professional productions, a complete and automated processing workflow has been developed, which is depicted in the flow diagram in Figure 2. At first, a colour correction and adaptation of all cameras is performed providing equal images among the whole multi-view camera system. After that, a difference keying is performed on the foreground object to minimise further processing. All cameras are arranged in stereo pairs equally distributed in the cylindrical setup. Thus, an easier extraction of 3D information from the stereo base system along the viewing direction is achieved. For stereoscopic view matching, the IPSweep algorithm is applied. This stereo processing approach consists of an iterative algorithmic structure that compares projections of 3D patches from left to right image using point transfer via homographic mapping. The resulting depth information for each stereo pair is fused into a common 3D point cloud per frame. Then, mesh post-processing is applied to transform the data into a common CGI (computer-generated imagery) format. As the resulting mesh per frame is still too complex, a mesh reduction is performed that considers the capabilities of the target device and adapts to sensitive regions (e.g. face) of the model. For desktop applications, meshes with 60k faces are used, while for mobile devices 20k faces are appropriate. After this processing step, a sequence of meshes and related texture files are available, where each frame consists of an individual mesh with its own topology. This has some drawbacks concerning the temporal stability and the properties of the related texture files. Therefore, a mesh registration is applied that provides short sequences of registered meshes of the same topology. In order to allow the user simple integration of volumetric video assets into the final AR or VR application, a novel mesh encoding scheme has been developed. This scheme encodes the mesh, video and audio independently by using state-of-the-art encoding, and multiplexes all tracks into a single MP4 file. On the application side, the related plugin is available for Unity and Unreal to process the MP4 file, decode the elementary streams and render volumetric asset in real-time. The main advantage is a highly compressed bitstream that can be directly streamed from hard disk or via the network using e.g. HTTP adaptive streaming. Unity and Unreal Engine are the two most popular real-time render engines. They provide a complete 3D scene development environment and a real-time renderer that supports most of the available AR and VR headsets as well as operating systems.