Volumetric video is regarded worldwide as the next important development step in media production. Especially in the context of rapidly evolving Virtual and Augmented Reality markets, volumetric video is becoming a key technology.

Fraunhofer HHI has developed a novel technology for volumetric video: 3D Human Body Reconstruction (3DHBR). The 3D Human Body Reconstruction technology captures real persons with our novel volumetric capture system and creates naturally moving dynamic 3D models, which can then be observed from arbitrary viewpoints in a virtual or augmented reality scene.

The capture system consists of an integrated multi-camera and lighting system for full 360 degree acquisition. A cylindrical studio has been developed with a diameter of 6m and it consists of 32 20MPixel cameras and 120 LED panels that allow for arbitrary lit background. Hence, diffuse lighting and automatic keying is supported.

The avoidance of green screen and provision of diffuse lighting offers best possible conditions for re-lighting of the dynamic 3D models afterwards at design stage of the VR experience.

In contrast to classical character animation, facial expressions and moving clothes are reconstructed at high geometrical detail and texture quality. The complete workflow is fully automatic, requires about 12 hours per minute of mesh sequence and provides a high level of quality for immediate integration in virtual scenes.

Meanwhile a second, professional studio has been built up on the film campus of Potsdam Babelsberg. This studio is operated by VoluCap GmbH, a joint venture between Studio Babelsberg, ARRI, UFA, Interlake and Fraunhofer HHI.


Thanks to the availability of new head mounted displays (HMD) for virtual reality, such as Oculus Rift and HTC Vive, the creation of fully immersive environments has gained a tremendous push.

In addition, new augmented reality glasses and mobile devices reach the market that allow for novel mixed reality experiences. With the ARKit by Apple and ARCore for Android, mobile devices are capable of registering their environment and put CGI objects at fixed positions in viewing space.

Beside the entertainment industry, many other application domains see a lot of potential for immersive experiences based on virtual and augmented reality.

In the industry sector, virtual prototyping, planning, and e-learning benefit significantly from this technology. VR and AR experiences in architecture, construction, chemistry, environmental studies, energy and edutainment offer new applications. Cultural heritage sites, which have been destroyed recently, can be experienced again.

Finally yet importantly, therapy and rehabilitation are other important applications, where VR and AR may offer completely new approaches.

For all these application domains and new types of immersive experiences, a realistic and lively representation of human beings is desired. However, current character animation techniques do not offer the necessary level of realism.

The motion capture process is time consuming and cannot represent all detailed motions of an actor, especially facial expressions and the motion of clothing. This can be achieved with a new technology called Volumetric Video. The main idea is to capture an actor with multiple cameras from all directions and to create a dynamic 3D model of it.

There are several companies worldwide offering volumetric capture, such as Microsoft with its Mixed Reality Capture Studio, 8i, Uncorporeal Systems and 4D Views. Compared to these approaches, the presented capture and processing system for volumetric video distinguishes in several key aspects, which will be explained in the next sections.

Concerning multi-view video 3D reconstruction, several research groups work in this area. In this paper a spatio-temporal integration is presented for surface reconstruction refinement. The presented approach is based on 68 4Mpixel Cameras requiring approx. 20 min/frame processing time to achieve a 3M faces mesh.

Robertini et al. present an approach focusing on surface detail refinement based on prior mesh by maximizing photo-temporal consistency. Vlasic et al. present a dynamic shape capture pipeline using eight 1k cameras and a complex dynamic lighting system that allow for controllable light and acquisition at 240 frames/sec.

The high-quality processing requires 65 min/frame and a GPU based implementation with reduced quality achieves 15 min/frame processing time. In the next section, the volumetric capture system is presented with its main feature of a combined capturing and lighting approach. After that, the underlying multi-view video processing workflow is presented.

Finally, some results of the most recent productions are presented. The paper concludes with a summary.

Download the full tech paper below