Words can hardly describe the collective shock, horror, and outrage that so many of us feel about the terrorist attacks on the World Trade Center and Pentagon on Tuesday, September 11, 2001. From a personal perspective, it affected me deeply - I never thought I would be calling and emailing so many people, just wondering whether they were still alive. But, for so many IT professionals working in financial services, telecommunications, and other mission critical IT industries, having survived the wreckage was just the beginning - the task of getting their company's critical infrastructure back up and running became their top responsibility.
I would like to take a moment to describe a small sliver of a silver lining that lay within these dark and horrible events. I do not want to minimize the lost lives and the terrible realities of the upcoming days, but I believe that now, more than ever, information security has a crucial role to play in assuring that the critical infrastructure that runs so much of our nation remains safe.
Y2K and the post-mainframe hangover
The monumental Y2K efforts were a unique opportunity for information security professionals to work with their business partners to solve a critical problem. It was well-defined, and also well-funded. Now, information security is now presented with a similar opportunity, this time around disaster recovery capabilities, a business critical need made clear by the WTC attacks. In addition to improving the defensive posture of the organization, we can help improve disaster recovery capabilities, and integrating security and other key risk professionals throughout the organization
Many people continue to believe, as I do, that the true lasting benefit of Y2K was that it forced us to inventory the assets that run our critical business processes. The largest component of the multi-billion dollar Y2K remediation effort was inventory - literally, walking through the vast data center caverns, trying to figure out what each server actually did, and then to determine what risk it posed to the enterprise.
Many information security professionals recognized immediately how critical this inventory effort was, and in many cases, became a key stakeholder in Y2K remediation. Why? You cannot protect things you don't know about. Because of the rapid growth and decentralized nature of IT in the post-mainframe era, this was the only way for security managers to re-establish security controls back into their existing business systems.
Loss of disaster recovery and repeatable builds
In the last five years, there have been too many examples of high profile outages, either caused by operational mistakes, inadequate planning, or security breaches. It underscores the fact that small changes can result in massive business interruption, resulting in long and protracted remediation efforts. Clearly, modern computing equipment is operationally dangerous, both to the operator and to the enterprise.
In modern production IT environments, these problems are perpetually compounded by multiple people having access and change authority, imperfect "document before you deploy" practices, infrastructure stuck in perpetual "break/fix" cycles, imperfect change control, inconsistent provisioning and installation, as well as a never-ending barrage of necessary patches and system changes that need to be applied to "keep systems secure.".
Consequently, IT data center constantly must apply changes, with little ability to undo them, and even less ability to remember what they did. What results is the loss of ability to repeat those changes (e.g., "repeatable builds"), degrading, if not totally destroying, the ability to recover from disasters. Maybe it's no wonder that IT seems to be in quite a bit of pain these days.
To help explain and measure the consequences of these symptoms, Dr. Spafford and I have created the IT Safety Index. It measures the ability of IT organizations to remediate from security breaches and outages. Where does your IT organization fit?
Level 0: Can your IT organization name all the critical business processes they are responsible for?
Level 1: Can you inventory all the assets that run each of those business processes?
Level 2: Can you repeatably build each of those assets?
Level 3: Can you detect changes in your IT production environment?
Level 4: Can you get early warnings and indicators of threats?
Level 5: Can you expand capacity efficiently and business processes portably?
Y2K inventory effort helped many organizations get to Level 1. The next is to regain the ability to repeatably build and recover from disasters, which is Level 2. In other words, if someone picked up the server and all the backup tapes, would your organization be able to recreate it?
A call to action: Improve disaster recovery capabilites
One of the responsibilities of information security is to ensure the organization has the ability to recovery from security breaches. In many ways, this parallels the responsibilities of business continuity and disaster recovery. Furthermore, those functions have security implications, and are hindered by the same forces that impede security remediation: the loss of infrastructure control, such as inaccurate inventories and the loss of repeatable builds.
I believe many stories will emerge of disaster recovery plans that failed after the WTC attacks. Usually when failover site cannot effectively take over the production site functions, the fault lay with improper testing, or more commonly, imperfect change controls and the resulting configuration drift. In other words, changes are silently made in the production data centers, and the changes never migrate to the failover data center.
This motivates the following call to action:
Find out if your organization has a disaster recovery or business continuity plan. If so, who owns it? In many organizations, these functions reside in two places: one residing in the IT organization, and one under the COO.
Find them, and tell them how you can help. Why will they want to talk to you? Because you can help, and provide resources! Paint for them how you share similar needs for adequately tested and defined disaster recovery capabilities. Explain how inadequate controls leads to configuration and integrity drift, which leads to failover sites going live, only to find that all the business functions don't actually function correctly.
Having trouble getting their attention? In many ways, you are both risk professionals trying to achieve similar missions. Not only can you help provide resources, but as a group, you can also lobby the CIO far more effectively - instead of being steamrollered, approaching them with two other organizational peers can help swing the argument your way.
Share data and resources, and create an ongoing plan. Once you create a relationship, start identifying deficiencies or things that can be improved. Fix them, but don't stop there. Meet regularly, and continue to improve disaster recovery capabilities, but also increase the organization's defensive posture, too.
As I write this, there are many heroic efforts happening in IT and information security that may never be fully appreciated. But, one thing is clear. The need for disaster recovery and business continuity will not be dismissed lightly, and they hold a key piece of the puzzle for infrastructure security. If we partner with them to solve some of their problems, as we did with Y2K, we will be one step closer to regaining infrastructure control. And that is a truly worthy goal, even if only recognized after a disaster that has truly incalculable costs.