IBC2023: This Technical Paper proposes a new approach to detect EFs for the broadcast network SDN controller within 500ms, thus allowing the SDN controller to re-route the EFs and reduce packet loss.

Abstract

Keeping broadcast IP network latency low is critical in maintaining the immersive viewing experience, especially when delivering high quality broadcast media over the Internet or broadcast IP datacentres. The network and resource requirements of heavy-hitting broadcast media flows with high datarates and temporal longevity clash with the needs of latency sensitive short data flows, leading to switch buffer overload and network congestion resulting in dropped packets and increased latency due to TCP-RTOs (Transmission Control Protocol Retransmission Time- Out). Within broadcast datacentres the media flows often fall under elephant flow (EF) classification, with the short flows being classified as mice flows (MF). Rapid and early detection of EFs will allow the SDN controller to re-route them and reduce their impact on the MFs within the broadcast IP network. This reduces packet dropout so that the TCP- RTOs are not triggered resulting in latency being kept low and the immersive viewing experience being improved. Although EF detection has been researched extensively, this paper proposes a new approach to detect EFs for the broadcast network SDN controller within 500ms, thus allowing the SDN controller to re-route the EFs and reduce packet loss. This method uses machine learning with ensemble LSTM (Long Short-Term Memory) neural networks, with each LSTM being a different length so the ensemble can capture the non-linear characteristics of the varying flow sizes. The ensemble LSTM outputs are then concatenated and further processed by a neural network. Training is achieved by back propagating through the neural network and then each LSTM resulting in a greater inference EF detection accuracy for the broadcast IP network. Our approach was tested on industry standard datasets and achieved EF detection in under 500ms without needing to be reliant on statistical information provided by network switches thus further reducing latency and improving the immersive viewing experience, unlike other approaches.

Introduction

As Internet Protocol (IP) packets are distributed asynchronously and randomly, there will be statistical peaks and troughs in the number of packets available in the network at any time. While IP packet loss is inherent within networks, Transmission Control Protocol (TCP) provides reliable IP packet delivery. Still, it does this at the expense of latency as it is connection-oriented between the client and server and relies on resend strategies to account for lost packets [2]. A client initiating a connection will wait for the server to acknowledge the request, after which the client sends the IP packets associated with the media being delivered. When all the data is sent, the client will close the connection.

TCP is used extensively for streaming video, audio, and metadata to viewers on their smart TVs and mobile devices. Consequently, the prevalence of TCP flows continues to grow significantly resulting in a greater number of EFs that need dynamic re-routing to reduce the risk of latency for viewers, especially when viewers exchange social media messages. Furthermore, as broadcast IP network infrastructures increase in complexity, which is inevitable if IP is going to deliver the flexibility it promises, then the prevalence of TCP cannot be ignored especially when broadcasters integrate uncompressed UDP streams, such as ST2110, with compressed TCP video and audio streams.

Elephant TCP flows tend to be temporarily long and fill up buffers within switches due to their high and steady data rates, especially when egress ports are heavily subscribed, and if most of the buffers are dedicated to these ‘high steady states’ then this can lead to packet loss for short packet bursts, resulting in increased latency and a poor user experience. Hash based ECMP (Equal Cost Multipathing) is often employed in networks to choose the shortest path for routing as it is simple to implement and doesn’t require per-flow information from the switches. ECMP is unable to differentiate between MFs and EFs and suffers from hash collisions sometimes resulting in multiple EFs being mistakenly sent across the same link, thus further exasperating buffer overflow and packet loss [37]. Therefore, there is a need to remove congestion on heavily subscribed egress ports and reduce the risk of holding back MFs that are short-lived, and time-sensitive [1]. To resolve this, Liu [6] proposed a load balancing mechanism based on SDNs for routing EFs by gaining the topology and status of the entire network. They then split and send EFs through multiple paths based on the parameters of the states of the links. However, for SDN re-routing to be effective, EF detection must be as fast as possible; this is what our work proposes. Unlike other methods that require many seconds of TCP flow data to establish an EF, our method achieves EF detection in less than 500ms. This reduces the risk of buffer overflow and hence packet drop, resulting in an improved immersive viewing experience.

Data centre measurements [22] [23] have shown that 80% of the total flows within the network are less than a few milliseconds long and less than 10KB in size. The majority of traffic volume is represented in the top 10% of large flows (EFs), and any significant bandwidth traffic (e.g greater than 1MBps) is often considered an EF [26]. Any competition between MFs and EFs for network resources often results in MFs being starved of bandwidth which often leads to dropped packets and increased latency [25]. Furthermore, re-routing the EFs to allow MFs greater bandwidth can potentially improve the network throughput [24]. The SDN controller does not need to process all EFs, only those that significantly impact the network performance. Inefficient management will fill network buffers with EFs, thus leading to queuing delays and dropped packets. Consequently, rapid EF detection is essential to reducing network congestion [27].

Figure 1 demonstrates how EF detection can be used to stop EF and MF conflicts by dynamically re-routing EFs. However, early EF detection is essential to reduce the risk of the switch buffers overflowing. Buffers that overflow will drop packets which in turn will lead to massively increased latency in TCP flows.

EF detection might appear relatively trivial as a network operator could argue that any TCP flow below a threshold of 250ms (for example) is a MF, and anything greater is an EF. However, only the EFs that significantly impact the network performance need to be re- routed, that is, EFs that are greater than 10s [28], and waiting for 10s to detect an EF is not viable in real world networks. Hence, our proposal can detect an EF in less than 500ms and therefore classify a TCP flow as either a MF or EF in under 500ms.

Several methods of EF detection techniques have been previously proposed [24], [25], [29]– [35]. However, they rely on short flow thresholds in the switch, which can lead to high rates of false positives and negatives. Some methods require periodic extraction of the flow statistics [24], [25], [33], [34] from the network switches to the SDN controller, which in itself may increase network traffic, thus contributing to congestion. This further leads to a significant increase in flow detection and re-routing latency.

Therefore, we propose a more nuanced data driven approach with the following key contributions:

  • The tokenisation of the TCP data streams into 10ms bins enables machine learning approaches to model the continuous flow data.

  • The introduction of a data driven temporal time prediction model, using an ensemble model of LSTMs (Long Short- Term Memory) layers to capture both short- and long- term temporal information about the data flow and classify a TCP stream as either a MF or EF, with low computational overhead.

  • Extensive testing of the proposed method on the industry standard CAIDA [8].

Download the paper below.