Walgreens reminds us to keep our database healthy

It’s clear that the super app you developed with your team, is really something, and it has high chance to be the “Bread and butter 3.0”, but are you prepared for the worst? There is a good chance that some combination of data, events and circumstances out of your control will eventually collude against you, to cause a failure or a software defect. You can try to anticipate and reason about all of the things that can go wrong with your application to prepare, but in a complex system it’s impossible to predict everything.

Not even big organizations, such as Walgreens or the State Department, both with huge budgets for IT and a large technical workforce, can avoid software disasters.

A few days ago, half of Walgreens’s stores were unable to fill prescriptions, for several hours, due to a database failure. Company spokesman Michael Polzin hinted that the problem arose during a database update that went awry, requiring the entire system to be shut down. This incident shows the risk of continuous delivery approach, with frequent updates to a software that it’s running, especially if the release includes changes database schema. What can they have done better? Could it have been avoided. And more importantly in the spirit of DevOps, could that have reverted to a previous version faster.

For these types of problems the consequences can be even more severe, for example the State Department’s global database for issuing travel documents, one of the largest Oracle-based data warehouses in the world that crashed last month, resulted in major delays for possibly millions of people waiting for passports and visas.

What are some of the ways organizations can avoid database updates impacting even trivial front-end software updates ?

Keep a 24/7 eyes on the system to catch problems early on and find unusual patterns that could mean underlying issues. Slow database response time, database load issues, unpredictable performance spikes and locking problems forecast serious complications that can end in crash. To constantly watch the system you will need the assistance of a software like AppDynamics’s Database Monitoring, that shows trend key performance metrics and sends alerts when database thresholds are violated or LogEntries Anomaly Detection that enables you to pro-actively identify and resolve performance issues through real-time alerting to abnormal activity or system events.
Start by planning and be ready to recover the database after a system crash.
Usually, when a database server is down, a procedure trying to restore the database to a consistent state by performing crash recovery automatically, when a user attempts to connect, is performed. Using the information saved into a transaction log, all effects of all committed transactions should be persisted in the database, and all uncommitted transactions should be rolled-back. If it succeeds the database is transactionally consistent and available for use, but if not I hope you have backups. As our case in point, the Walgreens crash, shows, you need even a better safety net: a rollback strategy, which is mandatory to prepare when you implement a change, thus recovering from disaster becomes fast, simple and verifiable.
Problems can occur through overwritten or mismatched modifications, for example, when the developer works on the application’s database mapped objects and he DBA is processing batches of changes at the same time. A powerful database change management system is the most effective way to overcome these challenges. Making all database changes to be executed using the same tools and processes across departments and individuals, with features like database version control (and not only versioning of the scripts or model, but have a second level of versioning that takes place in each database instance), continuous integration, and automation, will enable DBAs and developers to better communicate and collaborate with each other.
Architect your system with “It will crash” philosophy in mind, so with the belief that servers can die and routers or network links can crash at any time. If you have “no single point of failure” databases, where data is duplicated to several nodes, in order to avoid data loss during a node failure, and new machines can supplement the existing ones, to increase the capacity and data protection of your cluster, it will not matter if one is down. Amazon Dynamo and Apache Cassandra are databases in this category and their failover procedure allows, if a node fails or becomes unreachable, to automatically try other nodes in the cluster and to schedule reconnections to the dead node(s) in the background.
Evaluate target environments for conflicts and dependencies early on using for example a simulator like Forecast (from Datical) This software that is creating a model of the target environment in memory, simulates proposed changes on top of that model, and alerts if error conditions occur, or if data loss and performance issues are happening, without actually touching the target database. Because of it, DB admins can make changes several times during the development cycle, to go past issues that would normally be discovered only in the pre-release stage.

“USA Today” reported that “At least 55 major contractors were involved in creating, managing, or supporting HealthCare.gov,” and Forbes advanced the supposition that part of the data glitches for the site originated in the requirements phase. If that was case, it shows that in a large project involving hundreds of people, analysts and developers were disconnected, possible because the development process was overseen separately from the one of gathering the requirements.

New methodologies, such as Kanban provide a better choice to use with DevOps, including the entire flow, finalized with delivery, instead of just the development process, so consider the choice of changing your methodology. Dumping Scrum for Kanban, which changed the workflow, allowing to be added, beside other things, an automated database deployment and release to the DevOps pipeline, could be a better option.

But it’s not that simple, as several challenges make databases difficult to deploy: their complexity, sheer size (data volume), production-specific tooling and configuration, to name just a few. None of these are necessarily an imperative issue in itself, however, taken together, they can create a serious threat to a successful deployment.

What to do for better database deployability?

To do performance tuning, is tempting, but it should only be made in a pre-live environment, as changes should be minimized in production, ideally, the only database changes that should be made in production are changes to live data.
Designing it right and by moving the non-core data and logic out of the main tables it will reduce the database complexity and make it more straight forward to human understanding.
Database archiving should be a mandatory part of your enterprise strategy. Archiving not often requested data can improve performance and radically decrease the size of the database.
Divide a database into several. Consider information architecture and analyze your data system and separate the data that rightfully belongs in a large central database from the data that has been added out of inertia or convenience, and store them separately. Patterns, such as Event Sourcing, Command Query Responsibility Segregation, and message-queue based publish/subscribe asynchronous can handle cases with several data stores and can help you in the design.
It’s not a good idea to get BI(Business Intelligence) by combining reporting and production queries on the same transactional database. Production queries are short, doing besides reading, updating and inserting, while reporting queries are read only, but longer running and may involve many records from multiple tables to produce aggregate results. The problem is that a longer running reporting query could get in the way of production queries, preventing their timely execution and creating performance issues, so sidestep trouble by using a separate reporting database.
Create a lab. This is key. One that includes not just the software layer but the backend and all infrastructure matching your production environment. There is lot of initial overhead at first but IaaS tools like AWS, Azure, CloudShare can help speed the setup and maintenance a lot. Once the lab is setup using a tool like ScriptRock to ensure consistency, or regularly scripting the lab with consistent Chef and Puppet scripts will ensure consistency.
It’s difficult to predict behavior before the deployment, if database technologies, configurations, and features are not the same in development as in production. If the cost of some license is prohibitive, to the point that it can be only used in one environment, better avoid it because the unpredictability of a system without testing is not an acceptable risk you should take. Do not use production-only database technologies, tooling and configuration.
DBAs usually like to make the database as easy as possible to manage, especially if it has significant complexity, but while optimizing data for easy administration purposes sound appealing, it could create deployment headaches. For example, aggregating data in the same transactional database to make it effortless to audit, makes it painless for the DBA, but it works against deployability. So do not optimize the data for easy administration purposes.

The new type of database role in a DevOps organization is the Applicative DBA, part of development and operations team, not IT, which should be not only an expert on data in the context of development, but acquainted to scaling and disaster scenarios, able to foresee these outages and reduce their impact.

Not only can you help reduce the impact of database issues during deployment, you can make it much easier to revert to previous versions of the application are stable. Keeping you out of the news like Walgreens.