Can Maintenance Make Data Centers Less Reliable?

Is preventive maintenance on data center equipment not really that preventive after all? Can Maintenance Make Data Centers Less Reliable? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.

Is Maintenance Making Your Facility Less Reliable? article tells that Fairfax, whose firm has conducted in-depth analyses of failure rates in data centers, says too much maintenance can be disruptive to optimal configurations for reliable operations. The purpose of maintenance should be to find defects and remove them. But maintenance can introduce new defects. And whenever a piece of equipment is undergoing maintenance, your data center is less reliable.

Maintenance is a very lucrative business so many companies want to keep selling their maintenance plans. Guidance from equipment vendors sometimes slip into FUD (fear, uncertainty doubt) rather than sound methodology. To overcome this preventive maintenance threat, we must attack false learning. More is not always better. People respond to component failures, even if a system was not threatened. Human error is a leading cause of downtime.

It isn’t just human error: the very act of performing intrusive tasks under the theory of “preventative maintenance” can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s.

On the other hand ff you don’t do any maintenance and testing, you are just piling up all the things you didn’t take time to figure out until come critical time later. You can’t just blithely assume that things are always going to work as they are supposed to work. There is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking.

Only deploy stable, true and tested versions of software and operating systems. You plan, install and test your setup before it enters production. You make sure that you can survive whatever you throw at it including errors and incidents. You then figure out how much downtime you are allowed to have according to SLA. You then divide this number into equal sized maintaince windows together with the customer. And then you adhere to these windows!

Other times you keep your hands off the systems. Period. Plan your activities for the next scheduled closing window. Do not test and rehearse failures on production system! Common sense will easily yield 99.9%. Carefull planning and execution should be able to yield 99.99%.

Check also Dilbert: Datacenter


  1. counter strike global offensive says:

    good post, keep it up

  2. Tomi Engdahl says:

    Cable spaghetti can make maintenance a nightmare and thus make the installation less reliable:

    3 intriguing cable spaghetti ‘before and after’ scenarios

    As every seasoned network cabling installer and/or system administrator knows, “cable spaghetti syndrome” in wiring closets and around patch panels is a real common thing…and a real ugly thing. And yet, the solutions to this phenomenon are invariably just as elegant and intriguing as the starting scenarios are chaotic and unsettling…

  3. Tomi Engdahl says:

    Best practices for deploying high power to IT racks

    A new white paper from Raritan addresses considerations surrounding the deployment of high power to IT equipment racks.

    The paper contends that, with average rack power consumption still increasing, the deployment of high power to racks is becoming more of a necessity for data center managers. Increased efficiency means more power is available for servers to support data center growth.

  4. Tomi Engdahl says:

    In-house IT: poor value for money claims new survey

    CIOs increasingly think that on-premise IT systems are a waste of resources and cloud is the future

    According to a new survey from Savvis, three out of five IT and business managers believe owning and operating in-house data centres will drive computing costs upwards and waste resources.

    In addition, more than half of all CIOs believe they have wasted money on IT purchases, indeed a whopping 66 percent of US respondents have a purchase they regret making.

    The dissatisfaction with in-house IT is leading to an increased take-up in cloud according to the research, which was carried out by Vanson Bourne.

    Eighty five percent of companies are now using some form of cloud service.

    Most organisations are moving to private cloud – 42 percent, compared to 22 percent of public cloud users.

  5. Intel Embraces Oil Immersion Cooling For Servers « Tomi Engdahl’s ePanorama blog says:

    [...] can be messy to maintain. A little mineral oil spreads a long way (ie., it’s messy). If you plan to minimize the needed hardware maintenance and keep spare clothes when working with servers, the messiness might to be a very big issue [...]

  6. Tomi Engdahl says:

    Advantages of infrared scanning for data centers

    “A thorough infrared scan analysis can indicate hot spots and anomalies in electrical equipment that might compromise data center network reliability caused by high heat.”

    Koty then listed the following three ways IR scanning can support data center uptime.

    1. Detecting worn bearings, which may indicate above-normal heat in electrical systems
    2. Early-stage detection of irregularities in a data center’s support infrastructure
    3. Non-invasively detecting problems hidden from the naked eye

  7. Tomi Engdahl says:

    If the servers are too much compared to the performance of a real need, it will cause unnecessary energy consumption, maintenance costs and increase the lead to excessive investment in equipment. If, however, the service performance is inadequate, at worst, it can slow down or crash critical network services and result in losses.

    The server capacity testing is currently often lacking, since each service will have to draw up its own testing. The test preparation is hard work and requires a lot of expertise. Therefore, the test is often made only to the introduction, in which case merely to ensure that the performance is sufficient.

    However, among other things, software updates, and service use changes affect the system’s ability to continue to serve the users. That is why frequently encountered situations where a network is down or the operation has slowed down considerably.


  8. kasyno says:

    As a consequence of you My partner and i repay my good friend in instances associated with lager. We started out along with your pet that will not find weblog in which ruin us. He or she dispatched us often the handle of your respective website. Very well, My spouse and i lost…

  9. Tomi Engdahl says:

    It takes all sorts to build a cloud
    The magic of teamwork

    The warning came through loud and clear in our recent Regcast, Future-Proofing the Data Centre: if you want to build a private cloud, your teams must work together.

    That, HP’s David Chalmers told us, means creating a service delivery team: some of you from the server team need to work with a small group from the storage team, and a group from the storage team is going to be hanging out with the networks team, and so on.

    For next-generation data centre projects, you need dedicated cross-functional teams with complementary specialisations, aiming for a single service-delivery goal.

    The benefits to users of coherent project management may be obvious, but as anyone who has tried to do it knows, it is not easy.

    When we build cross-functional teams, establishing a common goal is the easy bit. What follows will determine whether that goal is realised effectively.

  10. Tomi Engdahl says:

    ARM servers: From li’l Acorns big data center disruptions grow
    Shuttleworth says ‘vast tracts’ of legacy apps ‘just don’t matter’

    The ARM collective doesn’t just want to get into the data center. It wants to utterly transform it and help companies “manage down the legacy” of existing systems

    How a data center is like a disk drive

    Masters also said that the shift towards hyperscale computing is forcing the change in the data center. Big apps require massively scalable, cut-down systems where the redundancy is in the software and in the quantity of hardware, not in any particular server that is equipped with all kinds of redundancies, because this cuts overall acquisition and operation costs.

    The companies that Red Hat is talking to in the hyperscale data center racket are looking at “fail in place” scenarios in which they treat a data center like a disk drive, and while they admire the compact and tuned nature of SoC server nodes with integrated switch fabrics, they are not even thinking about things at the rack level anymore, but at the data center level.

    With a fail-in-place data center, you load it up with a few tens of thousands of nodes that have networking on each server node, pipe in external networking and power, and you never do maintenance on it. If a server fails, you mark it as bad, like a bad block on a disk drive, and you just leave it in there and let the network heal around it.

  11. says:

    Hello, just wanted to tell you, I liked this article. It was inspiring. Keep on posting!

  12. Tomi Engdahl says:

    Study: Downtime for U.S. data centers costs $7900 per minute

    A study recently conducted by Ponemon Institute and sponsored by Emerson Network Power (ENP) shows that on average, an unplanned data center outage costs more than $7,900 per minute. That number is a 41-percent increase over the $5,600-per-minute quantification put on downtime from Ponemon’s similar 2010 study. “Data center downtime proves to remain a costly line item for organizations,” ENP said when announcing the study’s results.

  13. Tomi Engdahl says:

    Fat-fingered admin downs entire Joyent data center
    Cloud operator now home to most mortified sysadmin in the USA

    loud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center’s compute assets.

    The cloud provider began reporting “transient availability issues” for its US-East-1 data center at around six-thirty in the evening, East Coast time.

    “Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted,”

    The problems were mostly fixed an hour or so later.

    The cause of the outage was that an admin was using a tool to remotely update the software on some new servers in Joyent’s data center and, when trying to reboot them, accidentally rebooted all of the servers in the facility.

    “The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter,” Joyent wrote.


Leave a Comment

Your email address will not be published. Required fields are marked *