Business Continuity Management-Mission Control Facilities (Data Centres, Airports)

Share on :

Data centres have undergone significant evolution since the introduction of mainframe computers in 1945, leading to the emergence of various types, including Enterprise Data Centers, Multi-Tenant/Colocation Data Centers, Cloud Data Centers, Edge/Micro Data Centers, Hyperscale Data Centers, and Telecom Data Centers. Over the past four to five decades, the digital economy has experienced exponential growth, positioning data centres as pivotal components of the digital ecosystem. The reliability, resiliency, and restorability of utility infrastructure supporting data centres have garnered the attention of stakeholders, designers, construction service providers, and facilities management teams. In response to evolving business requirements, operational teams have refined techniques and procedures, particularly following significant events such as the dot-com bust of the year 2000 and the financial crisis of the year 2008.
Business continuity and disaster recovery are essential organisational procedures designed, assessed, and implemented for mission-critical facilities like Data Centres and airports. Power interruptions, cooling and water system failures and human errors are the predominant causes of operational failures. Facilities Management underscores the significance of disaster recovery and business continuity plans within the Service Level Agreement by attending to safety, legal and regulatory compliances, utility systems, and workforce challenges.

Business continuity strategies

Business Continuity Management encompasses any one-off or a combination of the following strategies:
 Active/Backup Model – Maintaining an active backup site to ensure the continuation of all mission-critical activities.
 Active Split Operations Model – The operations of an affected site may be delegated to multiple remote operating active sites.
 Alternate Site Model – Regularly alternating between primary sites.
 Contingency Model – Arranging necessary resources at the location in case of breakdowns.
In every Business Continuity Model, the ‘Maximum Tolerable Period of Disruption’ (MTPD) ranges from a few minutes to a couple of days annually. The organisation establishes ‘Minimum Business Continuity Objectives’ (MBCO) for each mission-critical asset operating in stand-alone status.

Data Center operations depend on business-critical utilities like Electrical Power Distribution, Uninterrupted Power Supply, Battery Bank, Cabling, Cooling Systems, Water Management, Fire Alarm and Suppression Systems, Security, Surveillance and Access Controls, Suppliers, Specialist Service Partners, and Support Manpower, which necessitate ongoing assessments, upgrades, and validation of risk mitigation strategies.

Business continuity management process flow

1. Program management –
The design basis for constructing electrical power distribution in a data centre is established to maintain the desired levels of availability and reliability of the system. Service level agreements with the service providers are designed to adequately reflect key objectives of business continuity, such as the Minimum Business Continuity Objectives (MBCO), Maximum Tolerable Period of Disruption (MTPD), and Recovery Time Objective (RTO). Generally, a minimum availability of 99.982% for Tier-3 and 99.995% for a Tier-4 level site is stipulated in Service Level Agreements. A specialised team must assess, prepare for, respond to, and manage natural or artificial disasters and system breakdowns. This team coordinates logistics for both internal and external support, prepares budget estimates, and oversees essential crisis management actions.
2. Risk and business impact assessment-
o Safety risk
 An assessment of safety risks associated with the electrical power distribution and cooling system must include comprehensive electrical load flow analyses and short-circuit studies. This evaluation should address the identification of thermal anomalies in electrical nodes, cable degradation, malfunctions of switchgear, incidents involving bypassing or malfunctioning safety interlocks, nuisance tripping, detection of unsealed openings facilitating rodent access within switchboards, and inadequacies in the as-built documentation of the power network. Furthermore, a systematic, integrated testing program must verify the reliability of interconnected fire safety alarms, suppression, access controls, and electrical and ventilation systems.
o Non-compliance and nonconformity risk
 Risk and business impact analysis will necessitate sufficient construction design details, documentation regarding non-compliance and nonconformity with electrical codes and regulatory standards, clearances from local governmental authorities, as-built system drawings, and walk-through observations.
o Operation risk
 Documentation – Inadequate or absence of design and construction details, operating procedures (SOP, MOP, EOP), and troubleshooting charts.
 A yearly system testing program will pinpoint potential risks for sourcing clean, dependable power and uncover opportunities for cost-effective risk management solutions.
 Identify the “Single Points of Failure’ within the power distribution network and cooling systems, particularly those potential failures that may be ascribed to human error and loss of standby redundancy.
 Failure Modes and Effect Analysis (FMEA) evaluation for equipment, components and technology upgrades.
o Environmental risk
 Identify and assess potential environmental hazards, such as
• Flooding of all or part of the site
• Fire or failure to preserve fire suppression system
• Overfilling fuel or containment storage tanks leading to spillages
• Untreated or partially treated sewage water,
• Vandalism
• Pandemics, and
• Water and air contamination.
o Suppliers and support network risk
 Identify and establish priority spare components and equipment based on
• Frequency of failures
• Operational criticality of spare components or equipment
• Cost impact
• Environmental impact
• Expected useful service life of the component or equipment
 Identify dependencies on support resources such as suppliers, outsourced workforce, and other elements.
 Response time and Resolution time SLA with suppliers and support teams.
3. Obsolescence management –
Assess the service life of equipment (Transformers, Diesel Engine Generators, UPS, Battery banks, Switchboards, Static Transfer Switches, Circuit Breakers, Power Cables, Central Chilling plant, Computer Room Air Conditioners, Water Plant, Lifts)
o Condition assessment
 Periodic condition assessment will include tests to identify hot spots, insulation degradation, load flow, short-circuit analysis, and grounding system tests.
 Partial discharge test of VRLA battery bank(s) with a variable load bank.
 Vibration and Noise analysis of rotating equipment
 Electromagnetic field, Acoustics emission tests, Air and water infiltration tests for construction structures and water piping networks.
o Repairability and replaceability of equipment
 Documentation– manufacturer’s manual for diagnostics, disassembly instructions, and repair tips.
 Modularity and accessibility – modularity of components and ease of disassembly
 Spare parts – availability, costs, standardisation
 Software – open-source compatibility, upgrade version
 Frequency of failures
 Non-compliance with legal or regulatory guidelines
o Business impact analysis will include loss of redundancy and minimum level of service acceptable to business.
4. Business continuity action plan –
• Resource planning must encompass support from the in-house team, service providers, and material suppliers.
• The facility’s support network should involve government authorities and specialists who can offer guidance and logistics in the event of a disaster.
• A team comprising both in-house and outsourced personnel should possess the requisite knowledge of environmental regulations and expertise in safety, health, and the subject at hand. A Responsible, Accountable, Consulted, and Informed (RACI) matrix must be established.
• The financial impact of risk mitigation measures should be evaluated and acknowledged concerning the business impact across each disaster recovery scenario.
• The in-house team must be evaluated and trained to gather support during a crisis. The call tree during a crisis should include property stakeholders, business owners, and on-site senior management.
The business continuity plan of action for the data centre utility and support system must include the following –
– Addressing concerns around safety and security systems based on risk findings.
– Protection system coordination and harmonics treatment
– Legal and regulatory compliance and documentation, including construction design details.
– Capacity management of critical equipment and systems
– Managing standby redundancy of equipment and system
– Performing Predictive and Proactive maintenance
– Repair, replace or upgrade systems to enhance reliability
– Failure Reporting Analysis and Corrective Action System (FRACAS) in place
– Develop training programs for in-house and outsourced workforce engaged full-time or call-out.
5. Competency and training program for support workforce –
o Competency assessment must include
 Contract Manager
 Facility and Operation Manager
 Engineers and Technical Supervisors
 Technicians
 SHEQ members
o Skill requirements
 Must match operation requirements of knowledge and experience.
The training program must include
 Safety risk management
 Environment impact management
 Data Centre design objectives
 SOP, MOP, EOP
 Practices
The numerical count of Full-Time Employees (FTE) must meet the requirements of the workload and criticality of the Data Centre.
6. Review and validate –
A desktop review of the Business Continuity Plan must be supported by historical breakdown data, manufacturers’ equipment guidelines, legal and regulatory compliance documentation, and an annual comprehensive testing program that establishes alignment with the business objectives. Key performance indicators for service providers must be established to meet the minimum business continuity objectives (MBCO), maximum tolerated period of disruption (MTPD), and recovery time objective (RTO).

Leave a Reply

Your email address will not be published. Required fields are marked *