Our Blog

Building Resilient IT Infrastructure – Lessons Learnt from OVH Data Center fire

A major incident in one of the data centers in France – OVH caused massive downtime and potential data loss for many customers as per the reports in Data Center World.

“On Wednesday, 10 March 2021 a fire broke out in a room at the SBG-2 OVHcloud data center in Strasbourg, France. The fire reportedly destroyed SBG-2 and damaged four of 12 rooms in the adjacent SBG-1 data center. Two adjacent data centers, SBG-3 and SBG-4, were not damaged but were shut down during the event, requiring a massive, time-consuming reboot of all their systems.”

As per the reports even after 14 days of this event, they are still struggling to bring customer services online. This might be due to the volume of the impacted hardware items and the amount of data.

Today myself together with our CTO Daniel Ananthan who is a specialist in data center design thought of exploring this incident so that all of our readers could learn from this incident.

As more and more applications, databases moving to the cloud and organizations adopting the cloud first approach towards their business organizations are becoming more dependent on CSPs. CSPs are becoming a critical element in business continuity. Even the most mature CSPs such as AWS and Azure have undergone large service outages.

What should CIOs do to minimize the impact on such outages ?

Reviewing the Service Level agreements and the business impact assessments would be an important future activity for CIOs. If you are dependent on cloud service providers for hosting business critical systems you would have to review the below elements of the Service Levels

  1. Level of Redundancy offered – Server / Pod Level, Data Center level , Network Redundancy 
  2. Measuring the Offered SLA against the Agreed – CIOs should have strong focus on monitoring the  service availability. 
  3. Having the Backups – While most of the CSPs are offering backups, it’s always safe to keep a copy of your backups including the SaaS services such as Office 365. Many backup platforms now provide native support for cloud service providers. Information such as backup and replication are often made the responsibilities under the shared responsibility model of the Cloud. 
  4. Studying and minimizing application dependencies – Many applications are still having an monolithic architecture and are dependent on specific resources such as VMs. If applications could be deployed on a decoupled cloud native approach which is failure tolerant we can minimize the impact of events such as this. Solutions such as Kubernetes and Microservices would help here. 
  5. Evaluating the Cybersecurity Controls – We have observed the growing risk of malware including ransomware which has resulted in many incidents of business losses. Going forward CIOs and CSOs will have to evaluate the information security controls against such attacks by the service providers.
  6. Conducting tabletop Disaster recovery drills and calculating the RPTOs using the tools and available automation systems

 

Having said that there are many organizations which are dependent on Data centers for their services. They run business critical data centers and might not be able to move to a cloud native architecture in near future. How could such customers who have on prem data centers or using Colocation facilities minimize the impact of such data center outages ?

Our CTO Daniel Ananthan highlights a few important points.

  1.   Early Detection and Suppression System is a crucial component most people miss during the Design.
  2.   During the Data center Physical Infra Designing, we shall look at all Critical and Risk Analysis criteria.
  3.   Electrical Power Distribution and Cooling also another Aspect and reason for fire. 
  4.   Power Capacity Planning and Predictive Alert system via DataCenter Infra Management is another key component for Operation of Datacenter. The Technology is developing for a small footprint but more processing power equipment, demanding Power consumption.

VS ONE is helping a number of clients optimize their data centers with our data center advisory services. This involves multiple assignments including analysis of the existing power, cooling arrangements, Identification of data center hotspots, analyzing the Fire controls including fire alarms and fire suppression solutions.   

ABOUT THE AUTHOR

Sadeepa Palliyaguru

Sadeepa is the Director / Chief Innovation Officer of VS ONE where he also specializes in Public Cloud Solutions and Information Security. He has over 10 years of experience in Software Development and Strategic Consulting and is also skilled in Technical Communication and Enterprise Systems. As a Solutions Architect, he has been instrumental in the design and development of multiple cloud and Software defined infrastructure solutions in Sri Lanka and overseas