We are in a new golden era of content creation writes Gregory Shiff. The explosion of streaming services has brought an unprecedented volume of new and amazing media. Production, post-production, visual effects, animation, finishing: everyone is booked solid with work. The expectations for this content are higher than ever, with new, technically challenging formats becoming the norm rather than the exception. Even in 2021, working with native 8K video or high frame rate 4K video (60 frame per second+) is no joke.

During post, storage and workstation performance can be huge bottlenecks. These bottlenecks can be particularly problematic for “hero” seats that work with uncompressed media in real-time. Remote Direct Memory Access (RDMA), an “old” / “new” (what do those words mean anymore…) technology improves storage and workstation performance simultaneously for systems handling the most demanding content. This article will examine using RDMA for NFS storage traffic over an Ethernet.

NVIDIA

NVIDIA

Why NFS? Linux is the operating system of choice for media professionals working with applications that support the most challenging media. Even if applications have Windows or macOS variants, the Linux version is used in the truly high-end. The native way for a Linux computer to access network storage is NFS. In particular, NFS over TCP.

This article is already going down a rabbit hole of acronyms, so let us pause for a moment. While I imagine that most people reading know about NFS (and SMB) and TCP (and UDP). For readers who are not familiar, NFS stands for Network File System. As said, NFS is how Linux systems talk to network storage (there are other ways, but mostly, it is NFS). NFS traffic sits on top of other lower-level network protocols, in particular Transmission Control Protocol, TCP (or User Datagram Protocol, UDP, but mostly it is TCP). TCP does a great job of handling things like packet loss on congested networks, but that comes with performance implications. Back to RDMA.

RDMA is a protocol that allows for a client system to copy data from a storage server’s memory directly into that client’s memory. The client system bypasses many of the buffering layers inherent to TCP. This direct communication improves storage throughput and reduces latency in moving data between server and client. It also reduces CPU load on the client and storage server.

Dell Technologies

Dell Technologies

RDMA was developed in the 1990s to support high performance compute workloads running over InfiniBand networks. In the 2000s, two methods of running RDMA over Ethernet networks were developed, iWARP and RoCE. iWARP uses TCP for RDMA communications, and RoCE uses UDP. There are various benefits and drawbacks of the two approaches. iWARP’s reliance on TCP offers greater flexibility in network design but suffers from many of the performance drawbacks of native TCP communications. RoCE reduces CPU overhead compared to iWARP but requires a lossless network. RoCE is the clear winner given that we are looking for the maximum storage performance with the lowest CPU load.

Put that all together, and you can run NFS traffic over RDMA leveraging RoCE. If your client workstation, network and storage support NFSoRDMA, you can massively boost performance by mounting the network storage with a few different commands. The performance gains of RDMA are impressive. RDMA can be twice as performant as TCP all other things being equal (with a similar drop in workstation utilization).

RDMA Testing

Read and write comparison for 50MB frames (simulating a 4K DCI resolution image sequence) and 190MB frames (simulating an 8K DCI resolution image sequence).

Let us look at some real-world examples in media creation. First up, 8K uncompressed. Uncompressed video puts less strain on the workstation (no real-time decompression), but file sizes and bandwidth requirements are huge. In the testing for this article, an 8K DPX image sequence was put on the Dell EMC PowerScale network storage. As an image sequence, each frame of video is a separate file. At 8K resolution - each file is approximately 190 MB. Sustaining 24 frame per second playback requires 4.5 GB/s. Long story short, the image sequence would not play with the storage mounted over TCP. Mounting the exact same storage using RDMA was a night and day difference: 8K video at 24-frames per second over the network!

Now let us look at workstation performance. To be fair, uncompressed 8K video is unwieldy to store or work with. The number of facilities truly working in uncompressed 8K is small. 6K PIZ compressed OpenEXR is a more common format. OpenEXR is another image sequence format (file per frame) and PIZ compression is lossless, retaining full image fidelity. The PIZ compressed image sequence I used had frames between 80 MB and 110 MB each. Sustaining 24 frame-per-second required around 2.7 GB/s. This bandwidth is less than uncompressed 8K but still substantial. However, the real challenge is that the workstation needs to decompress each frame as it is being read. Playback dropped frames with the network storage mounted using TCP. The combination of CPU cycles required to read and decode each 6k frame using network storage was too much. RDMA was the key for this kind of playback. Remounting the storage using RDMA enabled smooth playback of this OpenEXR 6K PIZ image sequence over the network.

TCP versus RDMA dropped frames

Comparing dropped frames with TCP and RDMA using Sony XAVC, ProRes 422HQ and DPX at 4K resolution in AutoDesk Flame 2022.

Going a little deeper with workstation performance, let us look at other common video formats: Sony XAVC and Apple ProRes 422HQ at full 4K DCI resolution and 59.94 frames per second. The application I used for playback shows video disk, GPU, and broadcast output dropped frames. With the file system mounted using TCP or RDMA the video disk never dropped a frame. The storage was plenty fast as were the beefy Nvidia RTX GPUs. With the file system mounted using TCP, the broadcast output dropped thousands of frames, the workstation could not keep up. RDMA was a different story, smooth broadcast output and essentially no dropped frames. In this case, it was all about the CPU cycles freed up by RDMA.

That was a lot of information in 958 words, so let me put it plainly: NFS over RDMA will play a vital role for creative companies working with 8K or high framerate 4k video. If you want to dig deeper into my testing and results, please click here.

By Gregory Shiff, Principal Solutions Architect, Media & Entertainment, Dell Technologies