Isn’t it funny how most crises don’t arise from just one bad event happening? Most crises arise from a long history of small, seemingly good decisions which weaken what used to be a resilient system. While all of those people walk around congratulating each other on cost and time savings, a small few are trying desperately to raise alarms. Those small few become cast as neigh sayers, the enemies of progress. Or, as with my personal situation, specifically ask to no longer be responsible for the resulting mess.
All of these small choices set the stage for how an organization can respond in the event of a large system failure. This was visible in large scale disasters like Hurrican Katrina’s decimation of New Orleans, and in a small scale disaster like a system backplane and drive failure in our production AS/400. This should have been an “Oh Shit” moment where I trigger the failover system, call a couple people and bring in the support consultants. Instead, it became an “OH MY EFFING LORD” drop to your knees and pray to whatever deity you hold dear that you can bring this poor sweet child back to life moment. Instead of writing a long tumultuous diatribe of all the choices that should have been made differently, let’s just examine how a few simple process changes could have prevented all of this.
Have at least two subject matter experts on staff (or retainer) who are familiar with both your technology infrastructure, applications, and expectations. My organization failed here in two ways: SME’s were not made directly responsible for these systems, and those who were made responsible did not heed the warnings, and only one SME was left on staff with no assistance (guess who earned this right of passage)
Listen to those in the know. I worked at McDonald’s to help put myself through my Associates Degree, and they had a simple management catchphrase that would get drilled into a new manager’s head: Put your aces in their places. Put your key SME’s where they bring the most benefit and your shift will practically run itself. This carries over into any business situation - the people in the know should be listened to, acknowledged, and empowered to make the right decisions when needed. Management should be supportive of their decisions because good responsible employees actually want things to run smoothly. No one wants to work almost 20 straight hours or earn all of their overtime in two days.
Build your key systems with the expectation something will fail. Redundancy is key - two of everything is the least you need. Dual servers, dual power supplies, redundant storage, multiple backup paths, and if possible real-time replication. Yes, this is expensive but well worth the investment. I would hate to hazard a guess at how much money the company lost in unproductive time. If that was invested upfront, this would have been avoided.
Finally, encourage your SME’s to build and document themselves out of a job lest they are stuck with it. No one wants to babysit an ancient, flaky system. Document everything down to the most mundane process, and document it in such a way that it can be handed to someone off the street to be performed. This holds three benefits: your level 1 support staff can take responsibility for the mundane; your SMEs can use the freed time for improvements to the system or themselves, and your investment in both those groups will net a greater return.
If you are that SME, especially the one who doesn’t really want that responsibility anymore, take my advice. Don’t shirk your responsibility to raise your concerns to the business. Get the investment to build up the system reliability, even if it comes in small spurts. Document the living shit out of everything and keep it in a well-known place. Put a consultant on retainer for the system. If you can’t go on vacation or turn off your phone/email without worry, then you aren’t done yet…or you need anxiety meds. Liquor helps there.