DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • DevOps Onramp
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Blogs » What Data Catalogs Don’t Reveal

data catalog Spotify

What Data Catalogs Don’t Reveal

By: Shevek on May 12, 2021 Leave a Comment

In today’s data intensive world, data catalogs, as the name implies, are simply catalogs of information (some human-entered and some extracted from metadata in systems). The importance of data in the enterprise has grown, and analytics plays an increasingly important role in generating value. So, while it’s important to collect and store and organize all this data, data catalogs are missing a major piece of the puzzle – they fail to draw accurate relationships between data assets and how they relate to (or are calculated from) each other.

Related Posts
  • What Data Catalogs Don’t Reveal
  • Defining the Next Cycle of Technology
  • A Developer’s Guide to CCPA, GDPR Compliance
    Related Categories
  • Application Performance Management/Monitoring
  • Blogs
  • DevOps Toolbox
    Related Topics
  • data
  • data catalogs
  • devsecops
Show more
Show less

It is like having a catalog that contains a description of all the electronics components in the world, but without an understanding of how to assemble those components into a clock, a radio or even a computer. Or how to diagnose what needs replacing when the device stops working.

AppSec/API Security 2022

Data catalogs cannot answer a whole host of questions along the lines of, “What are the consequences if I make a change to this dataset?” Even if the information in the data catalog is religiously updated, if a change was, in fact, made, users impacted by that change, whether directly and indirectly, have no way of knowing that a change was made. As a result, risk of unidentified issues is high.

The table below summarizes the strengths and weaknesses of traditional data catalog products.

Issue

Data Catalog Strengths

Data Catalog Weaknesses

Data Discovery

  • Collect metadata information from a wide variety of systems
  • Give users flexibility to manually update information
  • Simplify searching for data assets across a wide variety of systems
  • Make it easy to identify the subject matter expert (SME) related to a dataset
  • Rely on manually maintained documentation
  • Do not automatically relate data sets to each other
  • Do not understand how data assets are calculated from each other
  • Fail to identify who uses a particular dataset (and how they use it)

Data Governance

  • Provide a robust framework for managing the availability, usability, integrity and security of the data
  • Rely on manually maintained user-defined rules for dictating data access
  • Cannot automatically propagate access rules from one dataset to antecedent datasets

Infosec Compliance

  • Make it easy for users to label personally identifiable information (PII) and other infosec risk data sets
  • Do not automatically infer the implications of data processing on the movement of PII inside the enterprise

Data Quality

  • Data catalogs do not actually inspect data – so they do not assist with data quality issues
  • Complete failure to identify issues and communicate who those issues impact downstream from a data quality issue

 

Data Discovery

Data catalog solutions provide powerful frameworks for collecting metadata and facilitating documentation of datasets. The collection and aggregation of metadata is automated, and data engineers/analysts are, hopefully (but rarely), compliant in documenting their work so others can leverage it.

Unfortunately, data catalogs fail to identify how datasets and processing rely on each other. There is no way to reliably identify which users access and rely on particular datasets, both directly and indirectly.

Tasked with this objective, a data engineer can manually answer these questions by searching and analyzing the relevant data processing code. This approach uses expensive data engineering resources, takes time and results in reduced analyst and data engineering productivity.

Data Governance

When new data sets are created, the authors must tag the data to control data governance. Not only does this rely on compliance from the author, but the rules must be propagated to determine who can use the newly processed data. How should the rules be propagated? This is a complex issue. For example, the city in an address may be restricted, but aggregates across that city may not be. Defining these rules, in and of themselves, is complex — applying them is even more complex.

Reliance on data engineers and analysts to comply with data governance documentation and systems for enforcement is error-prone and expensive. Sensitive datasets are at high risk of being exposed.

Infosec Compliance

Current solutions to information security compliance rely on:

  • Users manually documenting if their data processing activities use (or produce) infosec-sensitive data.
  • Automated identification of infosec-sensitive data by inspecting the actual data and using algorithms to identify how the data needs to be secured.

Data catalogs are used to facilitate user documentation of data processing activities to comply with infosec compliance requirements. This process relies on humans manually maintaining documentation and systems (and performing actual data inspection.) These processes present significant security risks in and of themselves, because both data inspection – which itself presents a risk – and skilled manual work is required.

Data Quality

Data engineering already has processes to identify when there is an issue landing data in a table. Automated systems are available that inspect data values and make intelligent decisions (some even based on machine learning) about whether or not the data is acceptable. These systems exist outside of traditional data catalogs.

The question is, who will that problem impact? You must identify who are the immediate users of the data. You must also identify the indirect users of the poor-quality data.

The current approach to this problem is to have skilled data engineers review all data processing code and manually identify users to warn – this is both expensive and time consuming. Beyond this, the risk is that a user is not identified; what if the error impacts a report that lands on the CFO’s desk and they are not warned?

Data catalog solutions are good at solving the problem for which they were initially designed – finding datasets that are of interest to data engineers and data analysts.

The risk in DevOps is that data catalogs are relied upon for data discovery, data governance, infosec compliance and data quality issues that they fail to address in today’s complex data processing environments.

Filed Under: Application Performance Management/Monitoring, Blogs, DevOps Toolbox Tagged With: data, data catalogs, devsecops

Sponsored Content
Featured eBook
The Automated Enterprise

The Automated Enterprise

“The Automated Enterprise” e-book shows the important role IT automation plays in business today. Optimize resources and speed development with Red Hat® management solutions, powered by Red Hat Ansible® Automation. IT automation helps your business better serve your customers, so you can be successful as you: Optimize resources by automating ... Read More
« Consider Telemetry When Rearchitecting Applications
Application Security and API Security »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Bring Your Mission-Critical Data to Your Cloud Apps and Analytics
Tuesday, August 16, 2022 - 11:00 am EDT
Mistakes You Are Probably Making in Kubernetes
Tuesday, August 16, 2022 - 1:00 pm EDT
Taking Your SRE Team to the Next Level
Tuesday, August 16, 2022 - 3:00 pm EDT

Latest from DevOps.com

Techstrong TV: Leveraging Low-Code Technology with Tools & Digital Transformation
August 15, 2022 | Mitch Ashley
Five Great DevOps Job Opportunities
August 15, 2022 | Mike Vizard
Dynatrace Extends Reach of Application Security Module
August 15, 2022 | Mike Vizard
The Rogers Outage of 2022: Takeaways for SREs
August 15, 2022 | JP Cheung
5 Ways to Prevent an Outage
August 15, 2022 | Ashley Stirrup

Get The Top Stories of the Week

  • View DevOps.com Privacy Policy
  • This field is for validation purposes and should be left unchanged.

Download Free eBook

DevOps: Mastering the Human Element
DevOps: Mastering the Human Element

Most Read on DevOps.com

MLOps Vs. DevOps: What’s the Difference?
August 10, 2022 | Gilad David Maayan
We Must Kill ‘Dinosaur’ JavaScript | Microsoft Open Sources ...
August 11, 2022 | Richi Jennings
CREST Defines Quality Verification Standard for AppSec Testi...
August 9, 2022 | Mike Vizard
GitHub Brings 2FA to JavaScript Package Manager
August 9, 2022 | Mike Vizard
What GitHub’s 2FA Mandate Means for Devs Everywhere
August 11, 2022 | Doug Kersten

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.