Streaming audiovisual virtual reality (VR) content by brute force (i.e., streaming the entire 360° panorama) requires significant bandwidth and provides mediocre quality.
Several viewport-adaptive streaming schemes have been proposed to alleviate this; “Tiled VR streaming” is one such method. A major factor determining the QoE of any viewport-adaptive streaming technology is latency, and there are many different types of latency that contribute to the overall experience.
We explain these latencies, and what the authors did to reduce overall latency as perceived by the end-user. We give experimental results, showing that Tiled VR streaming using a commercial Content Distribution Network provides a great QoE, and explain how CDN and streaming protocol optimisations contribute to this QoE.
VR is gaining in popularity; new Head-Mounted Devices (HMDs) are announced almost weekly, and a significant market size is predicted for the coming decade.
The study published by Citi Research is one example; it predicts that [the] “VR/AR market could grow to $2.16 trillion by 2035 as different industries and applications adopt and make us of the technology”.
Other research, e.g. by Piper Jaffray, points in the same direction. This paper focuses on 360VR: an immersive audiovisual experience that is usually consumed using an HMD.
The 360VR market will only take off if high quality content can be streamed. YouTube would not be nearly as successful if content needed to be downloaded first. Unfortunately, streaming the entire 360° panorama takes (and wastes!) enormous amounts of bandwidth. A user typically only sees 1/8th of the panorama in the HMD, but the brute-force method that most major streaming platforms use, streams the entire sphere: eight times as many pixels as are consumed.
Various techniques for viewportadaptive streaming enable higher efficiency, by sending the part of the video that the user sees in high quality, while relaying the rest of the sphere in much lower quality. Some of these methods encode many different viewpoints (30 or more) as independent streams, and then switch between those as the user’s head turns. Kuzyakov and Pio describe the basic principles in a clear way.
The challenge with this approach is switching fast enough when the user’s head turns. Switching out one viewport (read: bitstream) for a completely new one takes time and renders the decoder buffer obsolete, causing the bitrate to spike, at the cost of both QoE and efficiency.
We employ a different method: tiled streaming. Our method relies on streaming a low-resolution base layer that covers the entire sphere as well as a selection of high resolution tiles that only cover the current viewport. We use network-optimised protocols and “Adaptive Switching Latency” client side logic to allow very rapid switching when head motion happens. Note that we use the words “latency” and “delay” interchangeably in this paper.