Comparative analysis of real-time multi-view reconstruction of a sign language interpreter

No comments

Abstract

Following the proliferation of smart handheld devices coupled with the launch of HbbTV 2.0 specification there is an emerging trend among broadcaster organisations to enable a higher degree of content accessibility across disabled communities. In this respect, the sign language interpreters play a crucial role in facilitating accessibility to mainstream broadcaster content for the deaf community. Addressing the increasing demand placed on sign language interpreters and current limitations on delivering sign language content complementary to mainstream media, this paper proposes the use of a low-cost 3D studio environment which enables photorealistic reconstruction of human avatars that captures the interpreter while eliminating the background. The core novelty of the proposed approach relies on the information fusion framework for correlating multisensor RGB-Depth sensors. The performance of the proposed approach for the reconstruction of a sign language interpreter has been evaluated using two low-cost sensors namely Kinect V2 and RealSense D435.

Introduction

In recent years, audio-visual equipment has become pervasive offering everyone accessibility to consume media services through a range of smart devices including mobiles, tablets and large-television sets. Among the global population, approximately 5% suffer from disabled hearing. In 2006 the United Nations adopted the United Nations Convention on the Rights of Persons with Disabilities to explicitly state the need for providing inclusive services and products to persons with disabilities. In particular, the UN Convention states that the right of persons with disabilities to take part on an equal basis with others in cultural life and shall take all appropriate measures to ensure this for persons with disabilities.

Following such recommendations, the Swiss public broadcaster SRG committed themselves via a contract with the national disabilities organisations to provide increased Signing of TV programs by more than 200% to 1000 hours first time signed programs, not counting any reused content. The same trend has been observed in other European countries like Germany (ZDF), Belgium (VRT) and the UK (BBC) while a shared agreement between disability organisations and the European Broadcasting Union (EBU) was released in 2017. In addition, with the expected launch of new technologies such as HbbTV 2.0 CSS, there is an emerging trend towards delivering accessible content across handheld devices, thus facilitating new services that are aimed to enhance the experience of members within the deaf community. While the delivery of existing services relies on the traditional use of high-end studio environments and capture setups that allow post-production tools to synchronise the mainstream content with the sign language interpretation of the content.

Among the several challenges to be considered are the topics of multi-view sensing for robust aggregation of depth maps from multi-view sensing hardware. The disadvantage of single-view 3-D reconstruction techniques that allow to obtain digital objects only from single viewpoint leads to the usage of multiple RGB-Depth (RGB-D) sensors such as the Microsoft Kinect and Intel’s Realsense. At the current state, for the signing of these programs additional studio resources need to be organised: studio, lighting, mixer operations, streaming setup, as well as the organisational efforts in term of resources and manpower. This infrastructure and process are cost-intensive and can reach up to a thousand euros per hour of signing. The current way of producing signed content for a broadcaster is via a dedicated studio created at the broadcaster premise, which includes professional lighting, a professional camera, a special background for the signer (to later create overlay effects via Chroma Key) and screens that reproduce the original content to be signed and the subtitles or scene description.

In contrast, the emergence of 3D representation techniques facilitates the development of a user-centric reconstruction framework based on low-cost sensors. Therefore, in this paper addressing the need of photorealistic reconstruction of a sign language interpreter that remains independent of the environment (thus eliminating the need of high-cost studio setup), a multi-view reconstruction system is presented that includes an information fusion framework to synchronise the data captured from multiple RGB-D sensors. In order to facilitate an objective assessment of the overall algorithm, three low-cost sensors are interfaced with the algorithm, which differs only in the aggregation stage of the data capture.