If you work in Computer Science, you are doubtless familiar with Murphy's Law, which states that whatever can go wrong, will eventually go wrong. Faith and prayer rarely help. And this talk is not just theoretical, stories of tech disasters in the NonStop Community prove this point again and again.
Over the many years since Tandem was born in the 1970s the community has learned that data is really your most important asset, and without it we have no business at all. The story I really love is of the Tandem disk packs when disks were configured liked top-loading washing machines. An earthquake shook these disk-packs, so much so that one of them fell on its side. The client called Support in a panic, but fortunately the system carried on running as normal. A couple of engineers had to come round to set the unit back upright, but everything just kept on running! It could have been a much worse situation, but even with the best laid plans something unusual can happen.
There have been countless stories over the years across the broader Computer Science industry of data incidents with profound consequences. However while accidents and disasters are inevitable, there are many steps professionals can take to minimise damage and avoid costly downtime.
#1. Even the best hardware will eventually fail
All hardware has an expected life span, and it pays to have a recycle plan in place. So whether it power, disks, memory or processors, a plan to upgrade old hardware in a managed plan may prove better than having to react to an unforeseen crisis in the middle of the night. Try to resist the temptation of making the hardware last as long as possible. It’s not a marathon, nor is it a sprint. Manage your hardware so that it’s not managing you.
Hardware is not available under plans that involve and include regular replacement. HPE has an associated plan like Greenlake that makes this easier. You’re effectively renting the hardware as and when you need it.
#2. Human error, caused by well-meaning employees, vendors or clients
Whoops, did I just press the wrong button? Even with the best will in the world, humans can make mistakes, and it’s a good idea to be able to recover from these. It could be something as simple and deleting the wrong file, or dragging and dropping a file into the wrong directory, or even removing the wrong disk. We have good procedures but they can go wrong. Even big robust companies like Amazon have had their moment in the spotlight, taking popular websites down with them.
#3. You can’t have too many copies of your data.
NTI helps our clients to preserve their most sensitive data by ensuring it’s replicated to back-up systems. These can be cold, warm or hot backups. They can be shared production installations with workloads spread across multiple servers and sites. We also help share NonStop data with other enterprise servers, whether for Dashboard analytics, for automated system monitoring, or for feeding to key client systems.
#4. Rehearse your recovery plan
Whatever you have as your recovery plan, resist the temptation to file it away in a secure draw never to be seen again. The recovery plan works best if it’s familiar to those that may need to use it in anger one day. If you’ve seen the TV series Chernobyl, you can see exactly how not to run a disaster recovery plan. It’s a classic.
I hope to bring you more interesting stories from NTI in the coming months.
Comments
Post a Comment