In today’s data intensive world, data catalogs, as the name implies, are simply catalogs of information (some human-entered and some extracted from metadata in systems). The importance of data in the enterprise has grown, and analytics plays an increasingly important role in generating value. So, while it’s important to collect and store and organize all this data, data catalogs are missing a major piece of the puzzle – they fail to draw accurate relationships between data assets and how they relate to (or are calculated from) each other.
It is like having a catalog that contains a description of all the electronics components in the world, but without an understanding of how to assemble those components into a clock, a radio or even a computer. Or how to diagnose what needs replacing when the device stops working.
Data catalogs cannot answer a whole host of questions along the lines of, “What are the consequences if I make a change to this dataset?” Even if the information in the data catalog is religiously updated, if a change was, in fact, made, users impacted by that change, whether directly and indirectly, have no way of knowing that a change was made. As a result, risk of unidentified issues is high.
The table below summarizes the strengths and weaknesses of traditional data catalog products.
Issue | Data Catalog Strengths | Data Catalog Weaknesses |
Data Discovery |
|
|
Data Governance |
|
|
Infosec Compliance |
|
|
Data Quality |
|
|
Data Discovery
Data catalog solutions provide powerful frameworks for collecting metadata and facilitating documentation of datasets. The collection and aggregation of metadata is automated, and data engineers/analysts are, hopefully (but rarely), compliant in documenting their work so others can leverage it.
Unfortunately, data catalogs fail to identify how datasets and processing rely on each other. There is no way to reliably identify which users access and rely on particular datasets, both directly and indirectly.
Tasked with this objective, a data engineer can manually answer these questions by searching and analyzing the relevant data processing code. This approach uses expensive data engineering resources, takes time and results in reduced analyst and data engineering productivity.
Data Governance
When new data sets are created, the authors must tag the data to control data governance. Not only does this rely on compliance from the author, but the rules must be propagated to determine who can use the newly processed data. How should the rules be propagated? This is a complex issue. For example, the city in an address may be restricted, but aggregates across that city may not be. Defining these rules, in and of themselves, is complex — applying them is even more complex.
Reliance on data engineers and analysts to comply with data governance documentation and systems for enforcement is error-prone and expensive. Sensitive datasets are at high risk of being exposed.
Infosec Compliance
Current solutions to information security compliance rely on:
- Users manually documenting if their data processing activities use (or produce) infosec-sensitive data.
- Automated identification of infosec-sensitive data by inspecting the actual data and using algorithms to identify how the data needs to be secured.
Data catalogs are used to facilitate user documentation of data processing activities to comply with infosec compliance requirements. This process relies on humans manually maintaining documentation and systems (and performing actual data inspection.) These processes present significant security risks in and of themselves, because both data inspection – which itself presents a risk – and skilled manual work is required.
Data Quality
Data engineering already has processes to identify when there is an issue landing data in a table. Automated systems are available that inspect data values and make intelligent decisions (some even based on machine learning) about whether or not the data is acceptable. These systems exist outside of traditional data catalogs.
The question is, who will that problem impact? You must identify who are the immediate users of the data. You must also identify the indirect users of the poor-quality data.
The current approach to this problem is to have skilled data engineers review all data processing code and manually identify users to warn – this is both expensive and time consuming. Beyond this, the risk is that a user is not identified; what if the error impacts a report that lands on the CFO’s desk and they are not warned?
Data catalog solutions are good at solving the problem for which they were initially designed – finding datasets that are of interest to data engineers and data analysts.
The risk in DevOps is that data catalogs are relied upon for data discovery, data governance, infosec compliance and data quality issues that they fail to address in today’s complex data processing environments.