Category Archives: Data Center

Integrated System Test of Data Centre

It is essential to understand that maintaining a reliable and high-quality data centre operation requires a strategic plan of action to keep its systems and sub-systems at multiple levels. The data centre operations and maintenance team must consist of competent personnel who have a fair understanding of the types and root causes of outages, the criticality of impact, and the cost-effectiveness of the solutions. An annual test program is an effective tool to assess a fair understanding of the design intentions and end-usage requirements and identify known and unknown gaps in the existing system. A planned test program of integrated systems and sub-systems of the data centre infrastructure provides information on functionality, performance efficiency, and gaps with intended outcomes under emergencies.

This write-up aims to notice the common outages encountered with data centres and the necessary preparations to respond to emergencies.
Various study reports on common causes of Data Centre outages in decreasing frequency order indicate –

1. Power interruptions
2. Cooling and ventilation systems failure
3. Human errors

Critical equipment and systems failures can have serious consequences, ranging from risks to life and safety to increased operational costs. To address these concerns, the integrated systems test program is developed to evaluate equipment functionality, assess equipment conditions, and provide training to operating personnel for effective emergency response.
Outages broadly under the above categories are classified to highlight impacts and associated costs.

Outages Impact Common Causes Annual Integrated System Tests Recommended Condition

The table above is a reference point for the Data Centre Operations and Maintenance team. Implementing proactive measures like Annual Integrated Systems Testing helps to ensure system reliability. The functional test script must be developed to map gaps across design and operational intent. Each issue needs thorough analytical deliberations among stakeholders to establish life safety, technical, and cost impacts.

Integrated System Testing is a crucial process for ensuring the smooth functioning of a data centre. It must be carefully planned and documented, considering the design and installed logic for operating under different conditions such as ‘NORMAL’, ‘EMERGENCY’, and ‘BREAKDOWN’. To conduct successful Integrated Testing, Load Banks, testing equipment, and trained personnel are essential. In addition, Fire and Life safety systems and sub-systems must be included in the test program.

IST’s outcome helps engineers plan for emergency responses in various breakdown scenarios in advance. Based on performance assessments of critical equipment, projections of power, cooling, ventilation, water, and essential resources can be made. It is also important to note that ISTs performed under experienced guidance establish the reliability of critical environment performance. IST’s outcome also helps draw up a business risk acknowledgement that is a heads-up to crucial management members for future change management.

Are you looking to take your Telecom business to the next level? One of the best ways to do that is by ensuring your captive Data Centres are reliable and operating efficiently. Let’s work together to assess and improve their performance so you can focus on growing your business.

Case Study: Data Centre Performance Assessment

Objectives:

Assess and improve captive Data Centres’ reliability and operational efficiency for Telecom business.

Terms of assessment service:
• One hundred thirty-eight telecommunication data centers located across geographical regions were to be assessed by a 3rd-party subject matter expert team.
• Make a physical assessment of Data Centre building infrastructures. Perform Condition Tests and evaluation of significant equipment to establish end-of-life management against baseline life expectancy.
• Perform a reliability study of the building’s critical systems comprising major equipment.
• Establish Reliability-centered Maintenance procedures to plan, prioritise maintenance, and roll out investment grade improvements.
• Assess compliance gaps with governing standards ISO 50600, ANSI/TIA-942 and NFPA 76 for the telecommunication infrastructures.
• Assess the skill gap in the Operating team and chart out an up-skilling program.
• Present improvement program to stakeholders, oversee implementation of approved projects and perform post-implementation assessment.

Challenges:
Poor Competency Proficiency Levels-
o Need for adequate knowledge and understanding of best practices and global standards.
Poor Performance Indicators –
o Many buildings are older than 25 years, housing legacy infrastructure design. Key Performance Indicators on Energy, Utilization, and Financial were well below the industry benchmark.
Inadequate or no governing standards –
o Localised bespoke solutions for improving Key Performance Indicators of Financial, Energy, and Utilization management were adopted.
Frequent breakdowns and low Reliability –
o In the past five years of operations history, fire mishaps, power interruptions, equipment and component breakdowns, and operators’ mistakes were recorded.
Inadequate reporting framework –
o The sustainability Reporting framework is adopted in various ways across all Data Centres.

Approach to Performance Assessment:
• Assess Competency Proficiency Levels of key Managerial and Engineering positions and roadmap for improvement.
• Walk-through survey, gaps assessment and setting target KPIs.
• Perform Condition Assessment of building fabric, Mechanical, Electrical, Plumbing, and Fire Alarm and Suppression systems.
• Perform Indoor Environment Quality checks.
• Conduct a Reliability study of Critical building systems.
• Establish a standardised sustainability reporting framework for all data centres across business units.
• Develop an investment grade improvement program for the Data centres aligned with sustainability principles.
• A follow-up review of the post-implementation of the corrective and improvement program is carried out.

Outcome of Performance Assessment:
Competency Management:
o A comprehensive competency management program helped up-skilling Managers and Engineers engaged in Operations and Maintenance services.
Business Risk Acknowledgement:
o Stakeholders of the Data Centre were presented with a comprehensive report on the Condition and Performance of the Data Centre infrastructure that highlights gaps compared to established industry best practices and standards.
o Risk and reliability assessment of the Data Centre presented scope for improvements.
• KPI improvement:
o Standardised KPIs aligned with sustainability principles were set out for all Data Centres.

Information sharing:
o The sustainability reporting framework adopted across all data centres improved transparency and data-driven decision-making processes.
Performance Improvement:
o Overall improvement in the performance of the Data Centres resulted in improved cost efficiency.

Key Performance Indicators for Data Centre Operations

Data centres are the backbone of today’s digital landscape, serving as vital hubs propelled by Industry 4.0 technologies, facilitating seamless communication, robust analytics, intelligent controls, and secure data storage. These centres are meticulously designed and constructed, varying in criticality and availability levels to ensure optimal efficiency in supporting business operations.

Data centres’ operational and maintenance facets revolve around a Reliability-centred Maintenance program meticulously tailored to meet stakeholder expectations and requirements. Establishing a comprehensive maintenance policy and strategy tailored to the unique needs of a data centre involves the creation of service-level agreements with a proficient and dedicated team.

To ensure the seamless functioning of data centres and their alignment with statutory, regulatory, and business imperatives, Key Performance Indicators (KPIs) play a pivotal role. These KPIs must resonate with the organisation’s strategic objectives and be easily adaptable by the operational team. They need to be easily tracked and controlled, and most importantly, they need to catalyse sustainable improvements in the following critical parameters.

 

  1. Compliances
  2. Statutory and Regulatory Requirements

Detail the specific regulations and standards the data centre must comply with (e.g., GDPR, Privacy & Security, ISO, and ESG standards), and measure compliance levels against each code separately.

  1. Standard Operating Processes

Regular audits or checks to ensure adherence to SOPs, with identified areas for improvement and training needs.

  1. Business Goals

Define and quantify how data centre activities contribute to achieving broader business objectives. Track progress against these goals.

 

  1. Infrastructure Sustainability
  2. Water Efficiency

Implement water usage monitoring systems to track and optimise water consumption.

  1. Energy Efficiency

Monitor and improve PUE by optimising cooling systems, hardware efficiency, and renewable energy usage.

  1. Percentage Share of Green Energy

Set targets to increase the proportion of energy sourced from renewable sources and evaluate progress regularly.

  1. Waste Reusability/Recyclability

Establish a waste management program, track the volume of waste recycled or reused, and set goals for improvement.

  1. System Uptime

Measure and report on uptime metrics, identifying root causes for downtime to minimise disruptions.

  1. Cost Efficiency

Conduct regular cost-benefit analyses to identify opportunities for cost reduction without compromising performance.

  1. Space Efficiency

Utilise data centre space optimally and consider metrics like space utilisation percentage or rack occupancy rates.

  1. System Utilization

Monitor server utilisation rates and allocate resources efficiently to meet demand without underutilising or overprovisioning.

  1. Demand Forecasting

Use historical data and predictive analytics to enhance accuracy in forecasting demand for computing resources.

  1. Reliability, Availability, and Maintainability

Implement preventive maintenance schedules and track metrics such as MTBF, MTTR, and overall system availability.

 

  1. Finance
  2. Operating Cost Efficiency

Break down operational costs and track efficiency improvements over time, aiming to reduce costs per transaction or unit.

  1. Construction, Refurbishment Cost Efficiency

Evaluate the effectiveness of capital expenditure on infrastructure upgrades or new construction projects.

 

  1. Health and Safety
  2. Life and Fire Safety

Conduct regular fire safety drills and inspections, ensuring compliance and readiness for emergencies.

  1. Health Wellness of Operating Team Members

Implement wellness programs, conduct health assessments, and gather feedback to improve the well-being of the operational staff.

 

  1. Security
  2. Physical Security Breaches

Strengthen physical security measures and track incidents or breaches to identify weak points for improvement.

  1. Cyber Security Breaches

Regularly test and update cybersecurity measures, track incidents, and assess the severity and impact of breaches to fortify defences.

By expanding and refining these KPIs, data centre operators can effectively measure and improve various aspects of their operations, ensuring alignment with organisational objectives, compliance with regulations, sustainability, cost-effectiveness, safety, and security. Regularly reviewing these metrics allows for adjustments and enhancements to drive continuous improvement in modern-day data centre operations.