DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • Calendar View
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • Calendar View
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Cloud Native Now
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • CI/CD
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Sustainability
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • How to Build Successful DevOps Teams
  • Five Great DevOps Job Opportunities
  • Serial Entrepreneur
  • Chronosphere Adds Professional Services to Jumpstart Observability
  • Friend or Foe? ChatGPT's Impact on Open Source Software

Home » Blogs » What Data Catalogs Don’t Reveal

What Data Catalogs Don’t Reveal

Avatar photoBy: Shevek on May 12, 2021 Leave a Comment

In today’s data intensive world, data catalogs, as the name implies, are simply catalogs of information (some human-entered and some extracted from metadata in systems). The importance of data in the enterprise has grown, and analytics plays an increasingly important role in generating value. So, while it’s important to collect and store and organize all this data, data catalogs are missing a major piece of the puzzle – they fail to draw accurate relationships between data assets and how they relate to (or are calculated from) each other.

Related Posts
  • What Data Catalogs Don’t Reveal
  • Using MongoDB for a Product Catalog Application
  • Replication: Complementing Disaster Recovery, not Replacing
    Related Categories
  • Application Performance Management/Monitoring
  • Blogs
  • DevOps Toolbox
    Related Topics
  • data
  • data catalogs
  • devsecops
Show more
Show less

It is like having a catalog that contains a description of all the electronics components in the world, but without an understanding of how to assemble those components into a clock, a radio or even a computer. Or how to diagnose what needs replacing when the device stops working.

Cloud Native NowSponsorships Available

Data catalogs cannot answer a whole host of questions along the lines of, “What are the consequences if I make a change to this dataset?” Even if the information in the data catalog is religiously updated, if a change was, in fact, made, users impacted by that change, whether directly and indirectly, have no way of knowing that a change was made. As a result, risk of unidentified issues is high.

The table below summarizes the strengths and weaknesses of traditional data catalog products.

Issue

Data Catalog Strengths

Data Catalog Weaknesses

Data Discovery

  • Collect metadata information from a wide variety of systems
  • Give users flexibility to manually update information
  • Simplify searching for data assets across a wide variety of systems
  • Make it easy to identify the subject matter expert (SME) related to a dataset
  • Rely on manually maintained documentation
  • Do not automatically relate data sets to each other
  • Do not understand how data assets are calculated from each other
  • Fail to identify who uses a particular dataset (and how they use it)

Data Governance

  • Provide a robust framework for managing the availability, usability, integrity and security of the data
  • Rely on manually maintained user-defined rules for dictating data access
  • Cannot automatically propagate access rules from one dataset to antecedent datasets

Infosec Compliance

  • Make it easy for users to label personally identifiable information (PII) and other infosec risk data sets
  • Do not automatically infer the implications of data processing on the movement of PII inside the enterprise

Data Quality

  • Data catalogs do not actually inspect data – so they do not assist with data quality issues
  • Complete failure to identify issues and communicate who those issues impact downstream from a data quality issue

 

Data Discovery

Data catalog solutions provide powerful frameworks for collecting metadata and facilitating documentation of datasets. The collection and aggregation of metadata is automated, and data engineers/analysts are, hopefully (but rarely), compliant in documenting their work so others can leverage it.

Unfortunately, data catalogs fail to identify how datasets and processing rely on each other. There is no way to reliably identify which users access and rely on particular datasets, both directly and indirectly.

Tasked with this objective, a data engineer can manually answer these questions by searching and analyzing the relevant data processing code. This approach uses expensive data engineering resources, takes time and results in reduced analyst and data engineering productivity.

Data Governance

When new data sets are created, the authors must tag the data to control data governance. Not only does this rely on compliance from the author, but the rules must be propagated to determine who can use the newly processed data. How should the rules be propagated? This is a complex issue. For example, the city in an address may be restricted, but aggregates across that city may not be. Defining these rules, in and of themselves, is complex — applying them is even more complex.

Reliance on data engineers and analysts to comply with data governance documentation and systems for enforcement is error-prone and expensive. Sensitive datasets are at high risk of being exposed.

Infosec Compliance

Current solutions to information security compliance rely on:

  • Users manually documenting if their data processing activities use (or produce) infosec-sensitive data.
  • Automated identification of infosec-sensitive data by inspecting the actual data and using algorithms to identify how the data needs to be secured.

Data catalogs are used to facilitate user documentation of data processing activities to comply with infosec compliance requirements. This process relies on humans manually maintaining documentation and systems (and performing actual data inspection.) These processes present significant security risks in and of themselves, because both data inspection – which itself presents a risk – and skilled manual work is required.

Data Quality

Data engineering already has processes to identify when there is an issue landing data in a table. Automated systems are available that inspect data values and make intelligent decisions (some even based on machine learning) about whether or not the data is acceptable. These systems exist outside of traditional data catalogs.

The question is, who will that problem impact? You must identify who are the immediate users of the data. You must also identify the indirect users of the poor-quality data.

The current approach to this problem is to have skilled data engineers review all data processing code and manually identify users to warn – this is both expensive and time consuming. Beyond this, the risk is that a user is not identified; what if the error impacts a report that lands on the CFO’s desk and they are not warned?

Data catalog solutions are good at solving the problem for which they were initially designed – finding datasets that are of interest to data engineers and data analysts.

The risk in DevOps is that data catalogs are relied upon for data discovery, data governance, infosec compliance and data quality issues that they fail to address in today’s complex data processing environments.

Filed Under: Application Performance Management/Monitoring, Blogs, DevOps Toolbox Tagged With: data, data catalogs, devsecops

« Consider Telemetry When Rearchitecting Applications
Application Security and API Security »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Securing Your Software Supply Chain with JFrog and AWS
Tuesday, June 6, 2023 - 1:00 pm EDT
Maximize IT Operations Observability with IBM i Within Splunk
Wednesday, June 7, 2023 - 1:00 pm EDT
Secure Your Container Workloads in Build-Time with Snyk and AWS
Wednesday, June 7, 2023 - 3:00 pm EDT

GET THE TOP STORIES OF THE WEEK

Sponsored Content

PlatformCon 2023: This Year’s Hottest Platform Engineering Event

May 30, 2023 | Karolina Junčytė

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Latest from DevOps.com

How to Build Successful DevOps Teams
June 5, 2023 | Mariusz Tomczyk
Five Great DevOps Job Opportunities
June 5, 2023 | Mike Vizard
Chronosphere Adds Professional Services to Jumpstart Observability
June 2, 2023 | Mike Vizard
Friend or Foe? ChatGPT’s Impact on Open Source Software
June 2, 2023 | Javier Perez
VMware Streamlines IT Management via Cloud Foundation Update
June 2, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

Most Read on DevOps.com

No, Dev Jobs Aren’t Dead: AI Means ‘Everyone’s a Programmer’? ¦ Interesting Intel VPUs
June 1, 2023 | Richi Jennings
What Is a Cloud Operations Engineer?
May 30, 2023 | Gilad David Maayan
Forget Change, Embrace Stability
May 31, 2023 | Don Macvittie
Five Great DevOps Job Opportunities
May 30, 2023 | Mike Vizard
Revolutionizing the Nine Pillars of DevOps With AI-Engineered Tools
June 2, 2023 | Marc Hornbeek
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.