A couple years ago, we had some spectacular security events that involved DevOps and Kubernetes, where the managing team simply redeployed containers whenever one crashed. It turned out that many organizations were doing the same thing, and, what’s worse, they were not talking about it because they knew it was not a long-term solution to crashing issues.
At the time, I thought these high-profile lessons would teach people to actually fix code issues, even in high-throughput Agile-meets-DevOps shops. I admit I also laughed, yet again, at The Phoenix Project’s expense, thinking, “Yeah, but the CISO isn’t holding things up by demanding that get fixed!” I wish that would ever get old, but it won’t.
It seems I was wrong. Or at least partially so. Some—perhaps many—shops seem to have missed the message that these issues sent. Anecdotally, we heard this was still happening, so we started asking questions. Yep. There are shops out there (since we started with ‘anecdotally,’ I won’t pretend to put numbers around it) that still take an “If it isn’t causing disruption in core systems, don’t fix it” mentality.
Ouch. We’ll be revisiting some of those in the future, I’m sure—when it’s listed as a vulnerability or breach somewhere.
I’m not a security professional. Because the vast majority of enterprises do not have dedicated security and even more don’t have anything close to enough security staff, I have done security as a part of almost every job I’ve taken—because it was what needed doing the most. Many, if not most of you, are in the same boat.
That said, I can tell when you are blatantly asking for a security incident. And running code known to be faulty or misconfigured to the point that it repeatedly crashes is deliberately asking for a security incident. “But it isn’t exposed to the Internet!” is only a valid rebuttal until an attacker gets inside the perimeter and uses your fault for something like privilege escalation. Then no one cares if it’s connected to the internet or not; they want to know how the attacker did it. And that’s on you.
So for those organizations in which speed has overridden common sense, let’s be very clear and to the point.
It is a best practice not to run crash-level faulty software.
Since you must not have one, write a policy that explains this and lays out a process and/or timeline to fix these problems. Then drive it home. Make certain it is known that this policy will be enforced. For that huge majority of my readers that don’t have the level of authority that “create and promulgate a policy” implies, convince those who do. Point out the issues to them in writing. “Notified in writing of faulty practices that could bite the company in the rear,” tends to get even the most unfriendly management to click into CYA mode and do something about it.
When we all watched those spectacular failures back in the day, I assumed that this was all about growing pains; you can do so much more with Agile/DevOps/cloud and/or Kubernetes that this type of thing would crop up and, eventually, get addressed. I still believe that. You all are still testing the boundaries and pushing the limits and making all of IT look good. It’s just going to hurt getting there sometimes. Fix the apps, make the policy, keep kicking butt and know you are killing it (as an industry. If you aren’t, stop laughing at my blanket compliment and fix yourself).