IT Operations is often accused to be the bottleneck in application release processes and a general constrain to agility. But why is that, what are the root causes of this issue and how can we address them?
There is indeed a very big gap between the desired world of DevOps with Infrastructure-as-Code, Docker, Microservices Architectures, Private Cloud and the Software-Defined Data Center and the realities of most software delivery in enterprise IT today.
Ideally tools are shared between dev and ops, information flows in real-time, everything is measured and visualized through dashboards and release pipelines are fully automated.
In reality, however, each major release of enterprise software requires a myriad of preparation activities from IT operations and infrastructure to ensure the operational readiness of the release in production. We call the set of these activities production integration.
We put these activities into four groups:
- Batch automation. New application releases often require new jobs and changes to jobs and job plans. Their impact on existing production chains must be assessed and controlled.
- Data Migration automation. Operations and infrastructure frequently has to support the data migration process and provide the technical foundation and automation.
- Deployment automation. Changes to deployment plans and application components will impact existing deployment scripts and routines. They need to be updated and re-tested.
- System Provisioning and Configuration. In the course of a new release, it is often required to add infrastructure such as servers, storage and network bandwidth. On top of it, middleware and databases need to be prepared for the new release.
Through over 300 customer interviews we identified the main reasons:
- Lack of software engineering methodology: Employees with sound software engineering skills are very rare in ops departments. However, sound software engineering practices are critical to build re-usable and modular automation frameworks and libraries. Without that, the risk is high that automation is created again and again for similar purposes without cleverly re-using what is already available.
- Priority Conflicts: In many organizations, the same teams work on production integration and production support. Urgent production issues cause shifts in priorities and delays in integration activities. The problem is exacerbated, because specifications for production integration tasks are often received late in the process, not leaving enough of time for the teams to deliver.
- Organizational Constraints: Over-specialization causes bottlenecks. We all know this database administrator, who is the only person in the company able to identify and find the root causes of performance problems, or this application server expert, who exclusively owns the deployment scripts. Also, dislocation of teams causes communication overhead and slows down the process.
We have identified a set of measures to address the most obvious obstacles to more agility and speed in the production integration process and increase its transparency. We call this the DevOps Production Integration toolkit. It is based on three pillars.
Agile Enablement of ITO: We recommend organizing the delivery of production integration tasks through a tailored version of Kanban. We have developed a methodology to integrate Kanban-driven ops with SCRUM-driven development in a scaled organization.
Kanban is based on managing a pipeline of tasks and visualizing their flow. With Kanban, team members pull from a prioritized backlog, and available delivery capacity is managed to avoid the occurrence of bottlenecks. Kanban allows for constant re-prioritization of tasks in the pipeline.
All activities of a team – including both the analysis and resolution of production problems and production integration tasks – are managed through a single pipeline of tasks. Color-coding helps to separate different releases and production issues, as well as identifying dependencies. The Kanban board is organized by swim lanes to account for different roles and skill sets in the team, for example database administrators, middleware administrators and production control experts.
A single prioritized backlog may be implemented either each Kanban team or across multiple Kanban teams. The backlog contains all items these teams are responsible for: all production integration requirements (we call them automation stories) and all production issues, which need to resolve.
The backlog is managed by a product owner (PO), who has to collect and maintain requirements from all stakeholders – development, operations, security, first line support and others – and is responsible for their constant (re-) prioritization to create and maintain a fully ranked list of items. Requirements from development are typically retrieved in the development sprint planning meetings.
The PO writes automation stories, which cover the “what” and the “why”, adds them to the backlog and works with the Kanban team on the creation of tasks.
Version Control for ITO: While advanced version control is part of every application development team today, it is not yet wide-spread in IT ops and infrastructure. However, with an increasing release velocity, the number of script versions and other automation artifacts, which are created, used and modified within different environments, increases significantly and automated version control becomes very important.
Modern version control systems support storing, tagging, branching, comparing and merging of multiple or many versions. The most recent generation of these tools such as git is based on the concept of de-centralized repositories and pull rather than push.
However, IT administration UIs and consoles of ITOM (IT Operations Management) software often do not support out-of-the-box integration with git or other version control systems. They usually store objects in proprietary databases. Quick and easy integrations of version control with these UIs and consoles become elementary.
Agile Metrics and Dashboards for ITO: Today, availability indicators such as MTBF (Mean-Time-Between-Failure) are the most commonly used performance indicators in ITO. However, to incentivize and motivate your teams towards more agility, it is important to combine them with agile indicators. Examples of agile indicators are:
- The wait time of a requirement in the backlog until it is processed
- The cycle time of a task through the development pipeline
- The relationship of idle time to processing time for tasks in the pipeline
Most agile collaboration tools make it easy to measure and extract this data. It is useful to
- Combine selected agile and availability indicators into KPIs; their relative weights depend on the context and the objectives of the organization
- Measure their current baseline
- Set realistic KPI targets
- Implement an automated measurement system
- Extract and visualize consolidated data through a dashboard
- Perform regular performance reviews or improvement meetings to analyze performance and performance trends, identify root causes for weak performance and measures to improve it
The DevOps Production Integration toolkit will not solve all agility constraints in ops, but complemented by automation tools and other measures it can be a powerful answer to many DevOps challenges. It can be implemented locally in ITO and does not require a major replacement of tools or organizational re-design.