Downtime is something no broadcaster wants and the ability to rapidly switch over to a back-up system is imperative. Given recent high-profile outages, is it time to rethink disaster recovery strategies?

The fire alarm that shut down systems at Red Bee Media last September was a disaster that highlighted the benefits of a robust recovery policy.

While BBC channels were switched from White City to Salford (Red Bee’s other UK transmission centre) with minimal disruption, Channel 4 and its sister channels continued to experience difficulties after the event.

Broadcast facility - Shutterstock

Disaster recovery: Being forced off air risks a loss of ad revenue and viewers

Observers in the industry, while not wishing to comment directly on that particular incident, spoke of the necessity for a disaster recovery strategy that enables broadcasters to maintain all services despite outages.

While Channel 4 is likely to have suffered from lost advertising revenue and reputational damage, facilities provider Red Bee Media has also been the subject of some viewers’ displeasure.

Some took to social media to express their displaeasure, with many complaints from the visually or hearing impaired questioning why audio description and subtitles were not seemingly prioritised as some channels returned to air without them.

Ofcom’s ire
Some weeks after the outage, UK comunications regulator Ofcom issued a statement specifically about the provision of access services, criticising Channel 4 for “not having a strong backup plan in place” and telling the broadcaster that it shouldn’t have taken several weeks to fix the problem.

It said: “After a long outage, subtitles have now been restored on many Channel 4 programmes. However, signing and audio description are still not available on the broadcaster’s channels.

“We remain deeply concerned about the scale of the technical failures experienced by Channel 4 and the length of time taken to fix them. These problems have caused deep upset and frustration among people who are deaf, hard of hearing, blind or partially sighted.

“Channel 4 did not have strong backup measures in place, and it should not have taken several weeks to provide a clear, public plan and timeline for fixing the problem.

“We expect Channel 4 to meet – or exceed – the timings it has set for restoring all its subtitling and other access services.

“When this is done, Ofcom will review the equipment and facilities that Channel 4 had – and now has – in place, so that lessons can be learned.

“We will consider what action might be required to make sure broadcasters do not find themselves in this situation again, and that subtitles, signing and audio description remain reliable even when problems occur with the infrastructure used to provide them.”

“Some broadcasters are better prepared than others,” says Michael Rebel, Director, Solution Architecture, Imagine Communications. “When a channel is forced off air, there are two consequences for the broadcaster. First, there is the direct loss of having no income because you can’t transmit any commercials. The second is harder to value, but potentially more devastating: the risk that viewers may go and discover other channels and other content.

“The adequacy of a disaster recovery plan is essentially a business decision: How long do you dare risk the loss of income and brand image?”

Red Bee Media says that since the incident, it worked closely with customers, landlord and business partners to fully restore all operations and systems.

Steve Nylund, CEO, Red Bee Media said: ”We have also invested a lot of time and effort to fully understand the cause of the incident, as well as evaluating all options to make sure it doesn’t happen again.

”The services most affected were operating on older technology platforms in the Broadcast Centre in London. Red Bee’s new hybrid-cloud platform with software defined workflows is inherently more resilient as it doesn’t rely on specific hardware instances in a single location.

”Customers that are already on this platform experienced minimal disruption as a result of the incident. There are multiple customers who are about to be onboarded in the near future and it is our ambition to migrate all customers onto these new operating models as soon as current contracts and other commercial considerations allow.

”Our customers have shown incredible support and cooperation when it comes to restoring services, which we are very thankful for. We are also very proud of the fantastic commitment and proactivity our staff has shown, both when it comes to the immediate response and ongoing efforts to restore services.”

Red Bee also said that later this year it wants to share more about what the company has learned from this incident.

Why have DR anyway?

There are two main reasons why you need a disaster recovery plan. First, a technical failure, which might be anything from a power outage to failure of a key piece of equipment or a natural disaster.

The second reason is around restrictions on staff working. That might be because of a fire in the building, disruption to travel and traffic around the site, or the need to protect the team by minimising its exposure to the viral transmission. This has been thrown into sharp relief under Covid-19.

“Broadcasters’ DR plans are more advanced than they were, but maybe not yet fully adequate,” says Ciarán Doran, Director of Marketing, Rohde & Schwarz. “Most DR systems require access to a back-up site, but Covid showed us that we also need to be prepared for situations where you cannot physically access your systems.”

Media organisations understandably make business decisions that balance the amount of downtime and loss disasters may cause against the expense of running the most secure recovery models.

“Broadcasters’ DR plans are more advanced than they were, but maybe not yet fully adequate,” Ciarán Doran, Rohde & Schwarz

“Generally, these top-level disaster recovery plans are adequate but they’re not perfect,” says Rick Young, SVP, Head of Global Products, LTN Global. “With natural disasters growing increasingly common and catastrophic, on-premise models will be severely tested.”

DR plans can vary between a complete replication of studio and equipment in another location to cloud-based instances ready to spin up. The adequacies of each approach vary widely depending on how they are measured, whether that’s cost effectiveness, robustness or ease of activation.

LTN_Rick Young

Rick Young: LTN Global

You get what you pay for

Large broadcasters typically have some sort of backup options to deal with each type of disaster. They may have a primary and secondary output for distributing signals, a temporary backup infrastructure for use only during the hurricane season (in the US) or a hybrid (physical and virtual) DR system to deal with inaccessibility to a physical location.

“Disaster recovery plans using hardware-intensive systems are, by and large, a luxury that only large broadcasters with deep pockets can afford,” says Srinivasan KA, Co-founder, Amagi. “They run their backup as a primary option once a month or once a quarter to test the system. But this is an expensive endeavour. Most broadcasters only use DR for their very profitable channels. With natural disasters becoming a more frequent occurrence, and with the unpredictability of system failures, there is definitely a need for broadcasters to rethink their DR strategies.”

Rethinking strategies

For on-premises playout and content management, there are several scenarios with different factors to consider.

“For high-value channels — in which failure can cause significant loss of revenue — broadcasters should consider a traditional hot/hot system with two servers at separate facilities running parallel with duplicate content and playlists,” says Young. “This is the safest option, with virtually imperceptible switchover if the first server fails, but it’s also the most expensive option, requiring two sets of hardware, bandwidth and separate facilities.”

A slightly cheaper option, according to Young, is running a ‘warm’ standby server ready to go. It may not have all the content or run completely in sync, and someone will have to manually trigger the switch, but it’s less expensive. Broadcasters will traditionally measure downtime in minutes in this model.

“For lower-value channels, the cheapest option is to identify an existing server that someone could switch to provide content and playout, but this takes a lot of time, measuring in hours instead of minutes.”

If a DR centre is going to be of any value, it must be at some distance from the primary playout. The implied criticism of the Red Bee Media/C4 scenario is that this was not the case.

“There is no point having a disaster recovery centre that might be evacuated by the same ruptured gas main, for example,” says Rebel. “Given the geographic diversity, then you have to design systems that will maintain the two sites in synchronisation, for content and for playlists. So, you must choose an automation and channel platform that provides mirroring intrinsically within the system.”

Automated failover is an option, but sometimes a broadcaster will need a more manual approach in the event of a disaster. Doran says there are many examples “where automated systems take the wrong decision, so it’s a mental challenge to leave such a critical decision of whether one playout chain stays on air or switches to another”.

Cloud control

The dependence on communication and the need for geographic diversity make the cloud the logical choice for DR playout. Indeed, backing-up playout in the cloud is seen as the optimum strategy by tech vendors talking to IBC365. Not just for VOD channels either.

“Linear playout now should be in the cloud, it’s literally the best place for it,” says Adam Leah, Creative Director, nxtedition. “Not just for the elasticity and scalability but also robustness. By providing a distributed process in the cloud you can mitigate the risk. The tricky part comes around live broadcasts, as this is where we tend to find latency, increased cost and security issues.”

A cloud DR strategy has many advantages. Among them, it allows for control of playout from literally anywhere with an internet connection.

“Should the primary centre need to be evacuated, then a channel controller can simply pick up a laptop and work from wherever they can get online,” says Rebel. “There is no reason why you cannot control a premium channel from a nearby Starbucks if that is the quickest way to get back on air.”

There are also bold claims for cloud’s cost-effectiveness, effectively mitigating against the all or nothing back-up systems of old.

“DR systems are now much more affordable, software defined and cloud connectable and it is no longer cost prohibitive to have DR in place for many more channels than just premium channels,” says Doran. “It is also possible, right now, to set up DR facilities that are remotely or cloud operated – for all or only parts of the workflow.”

This flexibility is inherent in the building blocks of cloud services. For example, playout software can remain dormant in the cloud until the moment you need it. “Should the worst happen all you need is the time to spool it up and you can be on air virtually immediately,” says Rebel. “With a cloud model where you pay only for the processing you need when you need it, this is very much a cost-effective solution.”

Srinivasan agrees: “Cloud can be your insurance where you only pay a fraction of the cost of the primary infrastructure for DR and run it only when it’s needed.”

Cloud also gives broadcasters the option to pick the kind of DR plan that best suits their needs, whether they are a billion-dollar TV network or a midsize or a small, niche channel.

“Larger networks can opt for a 24/7 disaster recovery option known as a ‘Hot DR’,” Srinivasan explains. “But then, there are also ‘Warm DR’ and ‘Cold DR’ options. With Warm DR, broadcasters can have content prepped and ready to go from the cloud, but not start the playout on the channel until disaster strikes. With Cold DR, there is no playout that is run from the cloud. Instead, evergreen content is stored in the cloud with a playlist. In the event of a disaster, this evergreen content starts playing out until the playout problem is resolved - an ideal solution for niche content owners.”

Technically, this is all standard stuff. Channel playout engines are now fully implemented software platforms that utilise microservices and modular architecture, so can operate in the cloud or the machine room equally.

KA Srinivasan Headshot

KA Srinivasan: Amagi co-founder

“That also gives the reassurance that, should you need to go to DR, the playout operations and user interfaces will look and operate exactly the same, with no risk through unfamiliarity,” Rebel says. “And security is excellent: AWS is used by the US Intelligence Community. If five nines [99.999%] are the gold standard for conventional playout the SLAs from the major cloud providers offer nine nines.”

Accordingly, broadcasters are now seeing cloud-based disaster recovery as the route to wider implementation of a cloud strategy.

“Broadcasters who are still thinking about transitioning their workflows to the cloud, can wet their toes by transitioning their DR to the cloud first,” says Srinivasan. “It is a low-cost and low-risk way of experimenting with the cloud, before deep diving into end-to-end cloud-based media management.”

“Broadcasters who are still thinking about transitioning their workflows to the cloud, can wet their toes by transitioning their DR to the cloud first,” Srinivasan KA, Amagi

By nature of being more distributed, cloud and IP-based systems can offer more stability, but it’s not absolute. An issue with Google’s servers brought down many of the company’s cloud services, leaving millions of users without access to their data last December. In June, a problem at cloud computing provider Fastly exposed the fragility of the internet when prominent sites including HBO Max, Hulu, Vimeo, Amazon, Twitter and Spotify were disrupted.

Would cloud have helped?

So, would a cloud DR have prevented the downtime experienced by Channel 4?

“Nothing is absolute, but a well-architected cloud-based model with redundant routes and automatic failover would likely prevent a similar scenario,” says Young.

“If you are replicating your playout system to one in the cloud, that should certainly reduce the outage time,” Doran says. “Just as with any crisis management situation where you work out everything that could go wrong and then do everything in your power to make sure it doesn’t, the best DR plan is to ensure your primary master control playout system is bang up to date and doesn’t malfunction.”

Nxtedition’s Leah says: “Risk can never be mitigated completely but it can be managed.” He calls for a new approach across the board. “We need to change minds. The old way of doing things is not the optimal solution available. By using microservices both on premise and in the cloud there are some remarkably efficient ways of replicating and distributing the content to playout. The industry needs to experience what that is like to fully embrace it.”

As the pandemic has shown, the ability to operate remotely is increasingly vital. As Doran outlines: “Covid highlighted the need to move to IP and software-defined infrastructures so that broadcasters can take advantage of the flexibility they offer. The more flexibility you can build into your system the more it can withstand a variety of disaster scenarios.”

The argument is that by moving workflows to the cloud, broadcasters can have a distributed infrastructure, with nothing on premise, while accessing everything remotely.

“If you do not need to cram staff together, then why would you?” insists Rebel. “Ultimately, the goal will be to decentralise all playout, allowing operations from any remote location to suit the channel and its staff, reducing the environmental impact of people travelling to work.

Google Cloud credit Sundry photography shutterstock

Google Cloud

“Any DR centre for a major channel – terrestrial or cloud – should be capable of taking over as close to instantly as the business demands. That might be a few frames, or it might be a few minutes. Anything longer means that you are not really recovering from the disaster.”

Perhaps the ideal disaster-proofing strategy would be to opt for a hybrid approach. Broadcasters using traditional infrastructure could mix on-premises and cloud-based DR mechanisms.

“Those who are already operating on the cloud could choose to invest in different regions of the same cloud service provider or choose multiple cloud service providers such as AWS and Google Cloud for running their DR systems,” suggests Srinivasan. “In the eventuality of one service provider facing an outage, the other would automatically fill the gap. This distributed infrastructure could prove to be the best strategy to mitigate the impact of a disaster.”

This article was updated on 24 Jan to include a statement from Red Bee Media