What Data Catalogs Don't Reveal

In today’s data intensive world, data catalogs, as the name implies, are simply catalogs of information (some human-entered and some extracted from metadata in systems). The importance of data in the enterprise has grown, and analytics plays an increasingly important role in generating value. So, while it’s important to collect and store and organize all this data, data catalogs are missing a major piece of the puzzle – they fail to draw accurate relationships between data assets and how they relate to (or are calculated from) each other.

It is like having a catalog that contains a description of all the electronics components in the world, but without an understanding of how to assemble those components into a clock, a radio or even a computer. Or how to diagnose what needs replacing when the device stops working.

Data catalogs cannot answer a whole host of questions along the lines of, “What are the consequences if I make a change to this dataset?” Even if the information in the data catalog is religiously updated, if a change was, in fact, made, users impacted by that change, whether directly and indirectly, have no way of knowing that a change was made. As a result, risk of unidentified issues is high.

The table below summarizes the strengths and weaknesses of traditional data catalog products.

Issue	Data Catalog Strengths	Data Catalog Weaknesses
Data Discovery	Collect metadata information from a wide variety of systems Give users flexibility to manually update information Simplify searching for data assets across a wide variety of systems Make it easy to identify the subject matter expert (SME) related to a dataset	Rely on manually maintained documentation Do not automatically relate data sets to each other Do not understand how data assets are calculated from each other Fail to identify who uses a particular dataset (and how they use it)
Data Governance	Provide a robust framework for managing the availability, usability, integrity and security of the data	Rely on manually maintained user-defined rules for dictating data access Cannot automatically propagate access rules from one dataset to antecedent datasets
Infosec Compliance	Make it easy for users to label personally identifiable information (PII) and other infosec risk data sets	Do not automatically infer the implications of data processing on the movement of PII inside the enterprise
Data Quality	Data catalogs do not actually inspect data – so they do not assist with data quality issues	Complete failure to identify issues and communicate who those issues impact downstream from a data quality issue

Data Discovery

Data catalog solutions provide powerful frameworks for collecting metadata and facilitating documentation of datasets. The collection and aggregation of metadata is automated, and data engineers/analysts are, hopefully (but rarely), compliant in documenting their work so others can leverage it.

Unfortunately, data catalogs fail to identify how datasets and processing rely on each other. There is no way to reliably identify which users access and rely on particular datasets, both directly and indirectly.

Tasked with this objective, a data engineer can manually answer these questions by searching and analyzing the relevant data processing code. This approach uses expensive data engineering resources, takes time and results in reduced analyst and data engineering productivity.

Data Governance

When new data sets are created, the authors must tag the data to control data governance. Not only does this rely on compliance from the author, but the rules must be propagated to determine who can use the newly processed data. How should the rules be propagated? This is a complex issue. For example, the city in an address may be restricted, but aggregates across that city may not be. Defining these rules, in and of themselves, is complex — applying them is even more complex.

Reliance on data engineers and analysts to comply with data governance documentation and systems for enforcement is error-prone and expensive. Sensitive datasets are at high risk of being exposed.

Infosec Compliance

Current solutions to information security compliance rely on:

Users manually documenting if their data processing activities use (or produce) infosec-sensitive data.
Automated identification of infosec-sensitive data by inspecting the actual data and using algorithms to identify how the data needs to be secured.

Data catalogs are used to facilitate user documentation of data processing activities to comply with infosec compliance requirements. This process relies on humans manually maintaining documentation and systems (and performing actual data inspection.) These processes present significant security risks in and of themselves, because both data inspection – which itself presents a risk – and skilled manual work is required.

Data Quality

Data engineering already has processes to identify when there is an issue landing data in a table. Automated systems are available that inspect data values and make intelligent decisions (some even based on machine learning) about whether or not the data is acceptable. These systems exist outside of traditional data catalogs.

The question is, who will that problem impact? You must identify who are the immediate users of the data. You must also identify the indirect users of the poor-quality data.

The current approach to this problem is to have skilled data engineers review all data processing code and manually identify users to warn – this is both expensive and time consuming. Beyond this, the risk is that a user is not identified; what if the error impacts a report that lands on the CFO’s desk and they are not warned?

Data catalog solutions are good at solving the problem for which they were initially designed – finding datasets that are of interest to data engineers and data analysts.

The risk in DevOps is that data catalogs are relied upon for data discovery, data governance, infosec compliance and data quality issues that they fail to address in today’s complex data processing environments.