Must You Collect All That?

It is interesting that we no longer consider teams that super-tool their apps as being slightly ‘off’ or, working in a field where systems were mission critical, even life-and-death. Now, we all are collecting and reporting data points in real-time across not just the SDLC, but across the entire environment. It is wild the amount of data available to the average DevOps team about what is going on with their applications. But it’s not all being used.

No, seriously, it is unreal. Let me offer two anecdotes that explain why I know you’re not using it all.

Most of my regular readers know that I worked at F5 for years, the first couple on the DevCentral team, doing the Dev part. Their flagship product, the BIG-IP, collected an incredible amount of super-granular data. I am not exaggerating when I say an incredible amount. They had data points from the wire up to the traffic routes through the load balancing algorithm to nodes. And it was (and likely still is) all available to users via APIs. Whatever data was needed about traffic passing through the BIG-IP or the nodes behind the BIG-IP and the connections in between users and those nodes, you could get.

One of my projects was to offer a simplified interface to the APIs. They were complex and granular, and customers wanted fast and easy, so I was publishing code that worked as an intermediate layer.

Note: F5’s APIs, like everyone else’s, are far more advanced today than they were then, and it is easier to do what you need to do because they took the “create a user layer” part seriously. It’s complex machinery so it still has complex APIs, but accessibility is way up from those early days.

Anyway, BIG-IP collected so much data that I could not name all of the fields, let alone when each one was pertinent. I gave up trying after a while, because if a board-level engineer needed a data point, the idea at the time was “Why not expose it?” so ‘number of connections to a given node’ under this algorithm was hidden among things like ‘microseconds to connect to DNS on that connection’ (I’m making those up; it’s been years). Some were terribly useful to end users and others were not. All were accessed in the same way, and learning by experimenting was the only real way to tell the difference.

Users did not need all of that data. Users did not need the complexity and obfuscation created by supporting all of that unnecessary data.

My other example is a massive system of interconnected IoT devices that were centrally connected/controlled with a private, specialized network to manage them (actually, there were several different networks on several different mediums, and that matters where data points are concerned).

When setting this system up, we estimated about half a million IoT devices with capabilities that ranged from passive monitoring to active controllers of physical devices. Four different communications mechanisms, a couple dozen different devices, every major (and a couple minor) databases and command-and-control software on every major OS, managing from a handful to hundreds of thousands. And all of these were super-tooled because the IoT network was mission-critical. Our databases and the infrastructure to support/protect them were a major part of our data center budget because of the amount of data.

The amount of data was a flood. The management software was developed by different companies with different transports and endpoints in mind, meaning each had a different dataset. Once all of the networks were running satisfactorily, our next immediate project was to create a unified dataset for reporting and troubleshooting. I was in charge of IT, but the top-level person–because each of those half a million devices was a physical install–was a physical engineer. We fought non-stop about what was important. I was trying to map and convert datapoints between systems; he was trying to maximize data available for reporting up. He often said, “Well no, we don’t need that today, but we might.” This put us in a zone where “number of devices on this system” was on par with “attenuation of the communications line to the nodes on this physical subnet.” Totally unworkable, and our compromise was to put together a small IT team to put a layer over all that raw data and offer an intelligent interface side-by-side with data dumps.

We are there again today. I hear many of my friends in the enterprise space saying, “Because of ML, we might need that data and storage is cheap.” I disagree. Storage is an overhead that is wasted if you are not actively using the data. We are past the “all data is good data” craze; it’s time to adapt. Useful data that expands insights is all you need tooling for. Report data that will help solve problems, and if the team finds tomorrow that it needs a data point that is not currently collected, it can be added in.

There is room for debate here–the rate of change for ML systems (please stop calling them AI; they’re not, yet, and we’ll need that word when they are cognizant) means that there is a reasonable argument for “all of the data.” And I’m not proposing limiting data collection. Honestly, application security specifically and enterprise IT security overall has benefitted from datapoint sharing between formerly separate market spaces. If the SCA scan shows X framework is used, then the IAST can narrow down tests to those that look at X and not its competitors, for example. But honestly, don’t drown yourself in a data lake when all you need is a fish tank where you can see all the guppies.

And keep rocking it. Even if you entered IT two years ago, you are now a seasoned vet and things have changed–beyond all of our expectations, I’d say. Yet you’re still plugging away, keeping the enterprise running. Don’t stop—just try to keep the data hoarding to a reasonable level and make sure the data you have is being used, whether as useful feeds to an ML tool or as consumable information for other people/systems.