Understanding the artifacts of a given technology component is critical to an enterprise-class DevOps service offering. Second-most critical is understanding the dependencies between given components. Testing helps illuminate those dependencies.
For example, let’s assume I have a server running the previous version of the Windows operating system (OS) that I now want to upgrade to the latest version. The previous deployment package of this version of the OS would be an artifact. It should also include all of the test automation I run to validate that version of the OS is working after I install it. The artifact also might include configuration files I deploy when I install that version of the OS on this type of server in this class of environment. Having these bits of automation bundled together allows me to do a successful install of my OS (assuming no hardware errors) on perhaps many servers.
As I create a new artifact for the latest Windows OS, I keep in mind all the things I had in my last version. I might include a newer config file and perhaps a set of tests (albeit new ones) to make sure it is working when I am done. Bundling all these bits gives me a new Windows OS artifact or package that I can use over and over on this kind of server in this class of environment. The idea is that I am creating a single kind of artifact or package to cover this kind of technology—in this case the latest version of the Windows OS operating system. All my artifacts should be version-controlled. So when I’m done I have an artifact for the previous Windows OS and one for the latest version. Each artifact version has embedded or included testing (pointers to, or discreet automation) to verify the validity of the install, and perhaps a little more to prove the OS works.
Within my DevOps tooling, I likely have some sort of artifact management system. It may not be as formal as a full-on configuration management database (CMDB) system, but whichever tool I use will have some sort of database to manage my artifacts or packages by version. How long I retain these artifacts will be defined in my retention policies, but likely I will retain every version in use at my company today. This is step one, and it forms the foundation for my next effort in cataloging the dependencies that exist between artifacts.
Enter the Spider’s Web
Under the scenario I have described, when I deploy either the previous OS or the latest one, I would run a series of tests specific to that version to ensure the OS is working properly. As an infrastructure engineer, I tend to think my job is done at this point. But it isn’t. To truly validate that my OS is working, I really need to know what is running on the server I just deployed to. If my server was bare or new, then I am really done. But if my server has middleware running on it, then the job is not finished until I re-deploy the right version of the middleware on top of my new OS and verify it is still working properly.
To pull this off, I will need a catalog of the technologies that depend on each other. The OS example is kind of simple, because it lives at the bottom of the stack where a server is concerned. But on top of the OS might reside a middleware platform (perhaps messaging or a database). On top of the middleware might reside an application or perhaps several applications, depending on the horsepower of the server in question. Upgrading the OS component implies I am redeploying the middleware layer and then the application layer on top of that.
Enter full-stack engineering. It takes an application-centric view of the world, but this implies a single application-centric view. It looks at one app from top to bottom and treats the entire stack as a single version-controlled artifact. But if you have legacy environments that run multiple applications on the same physical box, perhaps on the same virtual instance, you have the makings of a spider web of dependencies. There are no full-stack solutions (to my knowledge) that have figured out how to incorporate multiple applications at the top of the same stack, varying middleware in the middle and “n” number of virtual instances that ride on one OS on one physical box. Instead, each piece of the technology tends to be treated separately. Thus the need for some sort of catalog that tracks the interdependencies between them all.
Starting the Domino Testing
So once I have discreet testing that is version-controlled (or associated with) each type of technology component, I will have to re-execute those tests if anything “under” me undergoes a change. For example, if the version of my WebSphere is upgraded, beyond just the core WebSphere testing that occurs, I really should execute the tests associated with each application to ensure “that app” is still working after the upgrade. You could apply this premise to a database layer (say, upgrading the current version of Oracle, as an example). To really know if the database upgrade was successful, you should run the testing of each dependent app and ensure “that app” still works after the upgrade. The results are likely to be surprising.
To execute this kind of domino testing, even if you have the web of dependencies defined and the artifacts or packages managed, you can take one of two approaches: redeploy everything, although this method may not actually kick off tests if it detects the versions did not change from the last deployment (smart overwrite), or set up discreet testing jobs within your release orchestration software to execute the entire web of dependencies and track the results. In either case, you are going to need human oversight on whether it is better to keep the new version or roll back to the old one. Only a human can prioritize the impact of what might seem small or catastrophic (based on a human perspective) of a given app or series of apps.
Impact to the Bottom Line
True quality can be assessed only through testing. Part of breaking down silo-thinking is taking a holistic view of the technology your organization supports. Just because the OS support engineer’s testing passes does not mean the upgrade was a success. For an upgrade to be a success, the tests of everyone else who relies on that technology underpinning must pass to prove it worked. The entire team needs validation when the underlying infrastructure undergoes change. The dominoes of tests must pass from end to end. Only then can true quality or true success be measured.
Of course, most engineers are reluctant to have “app software” guys influence whether they did their own jobs right. After all, most software quality problems are in the code, not due to infrastructure errors. However, if the applications upon which your business relies cannot work, no matter what the reason, after an infrastructure upgrade, it is a problem for the entire business, not just the software guys. To tear down the silos, we need to look at impacting events for change across boundaries and drive toward overall results, not just departmental ones. When this occurs, the cost of quality is significantly reduced.
To continue the conversation, feel free to contact me.