As tempting as building your own DevOps incident communication tool sounds, you should think twice.
My colleagues and I recently were on the phone with a company seeking an IT incident communications and collaboration solution (a.k.a. IT alerting) to eliminate the time wasted with inefficient communications between teams when dealing with major IT issues. We were talking to Jeff, the director of IT operations, and Alec, the incident manager, about how our customers benefit from using IT service alerting solutions to automate the communication to their Dev and Ops teams, the stakeholders and the impacted customers every time a major IT incident is identified.
We insisted on the fact that communication automation not only would save them a significant amount of time but also provide a consistent, predictable and repeatable process with timely and relevant communications to the different groups of people. We then moved to the ROI discussion; very easily we were able to quantify the hard dollar gains they would realize. In fact, the company estimated that an unplanned downtime of its most mission-critical business application costs the organization an average of $1,700 per minute during business hours, and half of that when issues occur at night. Depending on the nature of the incident, they need to get anywhere between five and 14 people from several teams on a restoration call or war room, and it takes them at least 20 minutes to get the right people.
Right there, by automating the communication and the escalation, they could reduce that time down to five minutes and save the business at least $25,500 per incident.
We were 20 minutes into the call when the conversation took a very interesting turn. Based on the cost savings we had calculated, Jeff and Alec start thinking they may just be able to build their own communication automation solution in-house. “I’m pretty sure we can find a few developers and a DBA for a few weeks and get this done,” Jeff said.
An IT communication automation solution provides the ability to reach out to the right on-call people via different communication channels (voice, text, SMS, mobile app) until they reply. If they don’t, automated escalation kicks in and the next person is contacted. Once they acknowledge they’ve received the notification, they can jump on the incident restoration call in one click. The solution also enables the incident manager to notify hundreds or thousands of impacted customers within a few minutes so frustrated users don’t start calling the help desk.
“That’s a possibility, of course,” I said, “but do you really have an idea of what it would take to design the solution, build it, maintain it and support it when it is being used in production?”
Jeff had raised a very good point, indeed. Why not build your own solution when you have software development resources available internally? The conversation had shifted from talking about our out-of-the-box solution to digging into what it would take for DevOps to build their own.
What Does it Really Take to Build an IT Communication Automation Solution?
The problems Jeff and Alec raised are all really big problems, and they are big enough for many IT departments for them to decide to go down the path toward building and deploying their own internal IT alerting solution. This may be the right way to go, but you need to think through the development effort involved and, more importantly, support the maintenance of this new DevOps tool.
For some small shop, it may seem like a good idea, but building a reliable and economically viable IT alerting solution in-house requires more than just funding a development team for a few years and negotiating a good plan with telecom providers for SMS, voice and conference calls in the United States and in the countries where your IT people might be.
You will need to consider the cost to develop the solution, host it, support and maintain it.
To develop the solution, you’ll need to assemble a team comprised of a software architect, a few developers, a mobile app developer, a dba and a UX designer. Also, make sure they stick around for a few years so they can support and maintain their custom code and bring new features together as the tool becomes utilized across the enterprise. You surely want to avoid turnover here. The great benefit of such a custom-built solution is that it should address your specific needs, whereas a solution from a vendor may only accommodate 90 percent of those.
These seven developers will cost you about $1 million U.S., and this is not including resources for QA and UAT, the infrastructure guys and a technical writer. That’s another few hundred thousand dollars per year, at least, until the tool is ready for prime time!
In addition, you will need to reserve some operational budget to cover the cost associated with hosting, operating, monitoring, supporting and maintaining the solution:
- Hosting the tool. Account for the hardware, the databases and the software you’ll need to build, test and run the solution in the production environment. If you want the solution to be highly available, you will also consider active-active clusters and a disaster recovery site. That’s another few thousand dollars U.S. every year.
- Monitoring the tool. This new IT alerting tool most likely will be flagged as a critical application in your IT ops toolset and will require monitoring by the NOCs along with a runbook and an escalation procedure in case of a problem. Isn’t this a catch-22? When something wrong happens with your brand new IT alerting tool, you won’t be able to use it to reach out to the on-call people. Make sure you have a special emergency runbook for when this occurs. If you include technical maintenance, bug fixing, QA, that may add up to a half full-time employee, or roughly $75,000 per year.
- Integrating the tool. It is most likely that you will want to integrate your IT alerting tool with other IT infrastructure and IT service management (ITSM) tools you are currently using so that communications can be triggered automatically as soon as a critical incident is detected. You may need to find a few API developers to make this happen. Remember, those integration connectors all must be maintained and tested with every new release of these IT operations management and ITSM tools.
- Communicating with the tool. Besides voice, text and SMS, if mobile app push notification is a requirement then you need to build a mobile app so people can update their schedule and availability from their smartphone. This is a great feature that also will need to be developed, hosted, supported and maintained.
- Running the tool. You will need to provision for all your communication costs should they be voice, internal voice, SMS, international SMS and collaboration tools such as conference calls. This can become significantly expensive, especially if you have distributed teams across continents. Jeff said the company had on average two critical incidents per week, with an average of 10 people on each call who would spend an average of three hours per conference call. That’s almost 200,000 minutes per year for conference calls only, not including SMS and voice. Prices range from a few pennies per minutes in the United States to a few dimes in other countries.
What’s The Alternative to Building it?
There are several IT alerting solutions available out there. Some are:
- Designed for small local teams, others for broader and more distributed IT organizations.
- Built for IT operations teams, others are built for the Service Desk.
- On premise, others are cloud-based.
- Only cover the USA, others offer worldwide coverage.
- Are more expensive than others.
- Subscription based, others offer perpetual licenses.
- Mono product, others offer a platform you can leverage for other use cases.
There’s a large variety of solutions you could choose from which will address most of your current needs. As a result, you may be tempted by a subscription based plan so you are not stuck with perpetual software licenses and lengthy onsite installation and configuration, support and maintenance. In additional these vendors need to meet their SLAs which will give you the assurance of a constant and good quality of service. Remember that they are also incentivized to deliver best in class customer service because subscription models imply they need to keep you as a happy customer year over year.
How Should I Decide Which is Best for My Organization?
It should be pretty straightforward. If you are passionate about IT alerting then go for it and build or hire a team, provide them with a roadmap and cross your fingers that everything will go smoothly and on time.
At the opposite, if IT Alerting is outside your core business, if you don’t have bandwidth to develop this new IT Ops tool or if you need it sooner than later, then I think you are better focused on other IT projects which will have a direct positive impact on the core business of your enterprise. Save the time and budget to better align with the business and bring the new features that the business demands faster to market. If this is the case then you may consider speaking with an IT Alerting vendor. Gartner can provide insight on these vendors and how they integrate with the appropriate IT tools.
There are so many dependencies that it will always be less expensive to buy the service from a vendor whose focus is on IT Alerting than it would be for you to build one from scratch. Today most companies are undergoing some kind of digital transformation which means more and more of the services they deliver will rely on IT. The odds of something going wrong with IT and impacting the company’s business operations have never been so high and they are only growing exponentially. Add to this the increased mobility of employees and distributed nature of teams due to acquisition, or cost optimization, and you now realize how challenging the Build-it-Yourself option can quickly become.
At the end of the day, you and your teams have a much greater chance to be rewarded and recognized by your upper management for helping the company being more efficient and gain market share over your competition than for developing a new IT Ops tool which you could have bought from a provider!
Prior to this call, I doubt Jeff had thought through all the details of building the solution in-house, all the associated costs he would have to engage for building, supporting and maintaining his new IT Ops tool.
We’ll see what comes out and what Jeff, his team and the CIO finally decide to do, but here’s what he said when we wrapped up: “Good thing we talked today, and I am sure we’ll be speaking again very soon”. We’ll be talking again to Jeff and his team and find out what way they want to go with their IT Alerting solution: Build it or Buy it? This is the question.
Can We Remain Sane Without IT Alerting?
Without IT service alerting, when a critical IT incident occurs, the service desk or the critical incident manager ends up manually going through the following:
- Internal procedure documents or runbooks to find out “what team(s)” should be responding to the incident.
- Spreadsheets to identify “who” within the team should be contacted, then through listings to check “who is actually available.”
And this becomes tricky when the teams are distributed across buildings, across geographies and countries, as you need to account for time zones, vacations, time-off, sick days and anything of this nature that may prevent someone from responding to notifications.
Let’s say a dozen people from different teams have been identified to investigate a mission-critical application failure. That means you need to pull someone from the database team, two from the infrastructure team, three developers, two network engineers, the incident manager for this case, someone from UAT, two people from the NOC and someone from support. The developers are in Beijing, the service desk is in Atlanta and the two NOCs are located in the U.K. and the United States.
Now, they need to be contacted as quickly as possible and the war room needs to be assembled so they can start investigating the issue and come up with a resolution action plan. Plus, in some cases, more than a hundred people would be involved in a single restoration call for a few hours, and sometimes for days. I’m not going to throw any company names here, but you can easily find these disastrous stories online as they very often make mainstream press headlines.
In the meantime, don’t forget, the clock is ticking and every minute that passes means:
- Bigger revenue losses for the company, particularly in the case of an e-commerce website failure or slowdown,
- Employee productivity drop-off when the ERP system fails,
- Patient safety degradation if your EMR/EHR experiences a full or even a partial outage, and
- Big mess ahead when an airline reservation system has a hiccup.
As a reference point, the website Geekwire estimates that Amazon loses around $120,000 per minute when the site is not performing fast enough.
In these situations, you need all hands on deck to act quickly. Not only do you need to contact every person individually, but also ensure they respond to the call and commit to working on solving the issue. As you know, emails don’t wake people up, especially on a Sunday at 2 a.m. What if not all the IT experts answer the message within the allocated time frame? You now need to escalate to the next person on the list and try reaching out …
Also, people are not always reachable via the same channel at all time, in all countries. Some may prefer a phone call during the night and a SMS during weekends. For others, the best way to reach out is via mobile app push notifications. Ask yourself, is English OK for everybody, or do you need to take into account possible translation needs for different languages?
Even though you are able to get the right contact information for all the right people, calling each and every person and explaining the situation can take a very long of time, and the clock is still ticking. By that time, users of the IT service may have already experienced service degradation and have started calling in to the help desk or entering tickets online. Not only do you have to deal with the outage, but your IT department is now drowning in a ticket storm. The clock is ticking and the phone is now ringing!
If you are facing an e-commerce website slowdown or outage, you will also need to deal with bad reviews and negative comments being Tweeted and shared about your company over social networks.
Vincent Geffray
About the Author/Vincent Geffray
Vincent Geffray is Senior Director of Product Marketing at Everbridge with focus on IT Service Alerting & Communications Automation and IoT.
He has more than 14 years of experience in the information technology business, designing, promoting and selling Enterprise IT Operations Management solutions, including Critical Communications, Application Performance Management, IT Process and Workload Automation. He also has international experience as he started his career in Europe. Vincent holds a Master of Science (Mechanical Engineering and Computer Science) and executive certificates from the MIT – Sloan School of Management.
Vincent’s LinkedIn: www.linkedin.com/in/vgeffray
Vincent’s Twitter: @vgeffray