Is preventive maintenance on data center equipment not really that preventive after all? Can Maintenance Make Data Centers Less Reliable? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.
Is Maintenance Making Your Facility Less Reliable? article tells that Fairfax, whose firm has conducted in-depth analyses of failure rates in data centers, says too much maintenance can be disruptive to optimal configurations for reliable operations. The purpose of maintenance should be to find defects and remove them. But maintenance can introduce new defects. And whenever a piece of equipment is undergoing maintenance, your data center is less reliable.
Maintenance is a very lucrative business so many companies want to keep selling their maintenance plans. Guidance from equipment vendors sometimes slip into FUD (fear, uncertainty doubt) rather than sound methodology. To overcome this preventive maintenance threat, we must attack false learning. More is not always better. People respond to component failures, even if a system was not threatened. Human error is a leading cause of downtime.
It isn’t just human error: the very act of performing intrusive tasks under the theory of “preventative maintenance” can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s.
On the other hand ff you don’t do any maintenance and testing, you are just piling up all the things you didn’t take time to figure out until come critical time later. You can’t just blithely assume that things are always going to work as they are supposed to work. There is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking.
Only deploy stable, true and tested versions of software and operating systems. You plan, install and test your setup before it enters production. You make sure that you can survive whatever you throw at it including errors and incidents. You then figure out how much downtime you are allowed to have according to SLA. You then divide this number into equal sized maintaince windows together with the customer. And then you adhere to these windows!
Other times you keep your hands off the systems. Period. Plan your activities for the next scheduled closing window. Do not test and rehearse failures on production system! Common sense will easily yield 99.9%. Carefull planning and execution should be able to yield 99.99%.
Check also Dilbert: Datacenter