Have you ever had this happen to you?
Project Killer Kumquat is finally going to deliver the set of features that’s going to allow us to catch up to the competition. We’ve had over 300 developers have been working on this project for nine months. It’s been a death march for them.
This is one of those damned date-driven projects where senior management made some promise to Wall Street and customers that we were going to ship this week.
The developers were over two months late delivering their code. But, instead of what rational people do, the business just said, “That’s okay. We’ll just cut the time dedicated to the downstream tasks, like QA and Production Deployment.”
QA and Production Deployment. I’m the QA Manager. Between us and the deployment team, it’s like being stuck between the truck and the loading dock. It sucks.
29 hours ago, the developers checked in all their code, and we started the QA testing. Not only did things not go as planned, we now have a potential catastrophe on our hands. This was supposed to be a damned 4 hour deployment, and we’re 29 hours in, with no end in sight.
I look blearily at the clock that says it’s 3am, and I regret the decision I made twelve hours ago not to cancel this whole damned release and initiate a rollback. Now, it’s too late. We’re in so deep that we’ll be lucky if we have everything running by the time the East Coast customer start trying to access the systems in three hours.
I just knew something really bad was going to happen when the deployment team kept saying, “I just need another hour”, and I had already given them five hours. At some point, we should just put down the shovel and step away from the hole.
Now it’s pretty clear what happened. And upon some reflection, and after taking a 15 minute walk outside to clear my head, I’m starting to think that this is what happened to us in our last release, too. (But nowhere nearly as painful…)
28 hours ago, when we started testing, my team started finding failures left and right. Which is what we expected, given all the corners that were cut by the developers because of deadlines. But, for some of these issues, it took us hours to figure out whether it was a problem with the code, or something wrong with the QA environment, like an incorrectly configured OS, library, database, or variance between what we’re using and what Dev used.
And so, being the heroes that we are, once my team started finding the errors, we bent over backwards to fix them. We changed mount points, we modified configuration settings, changed file permissions, modified database stored procedures, we added user accounts, etc…
The problem is, none of those changes were systematically replicated downstream to production.
In fact, our problem is right now, my team is so tired from 28 hours of firefighting, they can’t remember what they did to get things running. (Jeez. I’m looking at one of my guys trying to figure out what he had written on his hand eight hours ago to figure out what he did, but it’s long since faded.)
And so now, we’re repeating the whole firefight again, but this time in production. And frankly, we’re now screwing up more stuff than we’re actually fixing.
But, actually, that’s not the worst part. Some stuff is breaking because this happened in our last release, and all *those* changes weren’t systematically replicated into our Dev and QA environments!
Lessson: Preproduction changes must be captured, and systematically replicated on downstream systems (e.g., Production), as well as queued up to be replicated in upstream systems for the next release (e.g., Dev, Integration Test, etc.)
This is one of my favorite uses of Tripwire, which is to control pre-production environments, to ensure that we can quickly move releases into production, faster than ever, without introducing chaos and disruption to the production environment. I’ll write more about this later.
Questions or comments? Feel free to send me a note on Twitter! I’m @RealGeneKim.