Data Governance: What, Why and How

Sairam Krish
4 min readDec 17, 2023

--

Data governance is the process of managing the availability, usability, integrity, and security of the data in enterprise systems. It involves defining and implementing policies, standards, roles, and responsibilities for data quality, access, and usage. Data governance helps organizations to ensure that their data is trustworthy, consistent, and compliant with internal and external regulations.

What are the Subcategories of Data Governance?

Data governance can be divided into several subcategories, depending on the focus and scope of the governance activities. Some of the common subcategories are:

  • Metadata management: This involves creating and maintaining a catalog of the data assets, their definitions, attributes, relationships, and lineage. Metadata management helps to improve data understanding, discovery, and integration across the enterprise.
  • Data quality management: This involves measuring, monitoring, and improving the accuracy, completeness, validity, and timeliness of the data. Data quality management helps to ensure that the data meets the expectations and requirements of the data consumers and stakeholders.
  • Data security and privacy management: This involves protecting the data from unauthorized access, use, modification, or disclosure. Data security and privacy management helps to comply with data protection laws and regulations, such as GDPR and CCPA, and to safeguard the sensitive and confidential data of the organization and its customers.
  • Data access and usage management: This involves defining and enforcing the rules and permissions for data access and usage across the enterprise. Data access and usage management helps to distribute the data to the right people, at the right time, for the right purpose, and in the right format.
  • Data lifecycle management: This involves managing the data from its creation to its deletion, including data acquisition, storage, processing, analysis, distribution, archiving, and disposal. Data lifecycle management helps to optimize the data resources, reduce the data costs, and comply with the data retention and disposal policies.

What are the Open Source Tools for Data Governance?

There are many open source tools available for data governance, each offering different features and capabilities. Some of the popular open source tools are:

  • Amundsen: Amundsen is a data discovery and metadata platform, originally developed by Lyft. It allows users to search and browse the metadata of the data assets, such as tables, columns, schemas, tags, badges, owners, and descriptions. It also provides data lineage and usage information, as well as collaboration features, such as comments and feedback. Amundsen supports various data sources, such as Hive, Presto, Redshift, BigQuery, and Snowflake, and integrates with Apache Atlas for metadata management.
  • DataHub: DataHub is a metadata platform for the modern data stack, originally developed by LinkedIn. It enables users to discover, understand, and trust the data assets across the enterprise. It provides a rich metadata model, covering various aspects of the data, such as schemas, ownership, lineage, glossary, documentation, and quality. It also supports data observability, data privacy, and data access control. DataHub supports various data sources, such as Kafka, MySQL, Oracle, Hive, Spark, Elasticsearch, and Airflow, and integrates with Apache Atlas and Amundsen for metadata management.
  • Apache Atlas: Apache Atlas is a scalable and extensible framework for metadata management and governance. It provides a common metadata store, a business glossary, a data lineage service, a REST API, and a web UI. It also supports data classification, data security, data lifecycle, and data quality. Apache Atlas supports various data sources, such as Hadoop, Hive, HBase, Kafka, Storm, and Sqoop, and integrates with Apache Ranger for data security and access control.
  • Egeria: Egeria is an open source project dedicated to enabling teams to collaborate by making metadata open and automatically exchanged between tools and platforms, no matter which vendor they come from1. Egeria defines the open metadata standard schema for over 800 types of metadata needed by enterprises to manage their digital resources. It also provides open APIs, frameworks, connectors, and interchange protocols for these standard types to allow tools and metadata repositories to share and exchange metadata using these open standards. Egeria supports various aspects of data governance, such as metadata management, data lineage, data quality, data security, data privacy, data access, and data lifecycle. Egeria supports various data sources, such as Hadoop, Hive, HBase, Kafka, Storm, and Sqoop, and integrates with Apache Atlas, Amundsen, and DataHub for metadata management.

Comparison of different open source tools

My choice and reason behind it

Data governance is a challenging but essential task for any data-driven organization. To simplify this process, I have chosen datahub-project as my preferred open source tool. Here are some of the benefits that datahub-project offers for developers:

  • It has a commercial friendly license — Apache-2.0 license — which means it can be used and modified freely without any legal issues.
  • It provides a REST API to integrate with existing micro services, making it easy to connect datahub-project with other data platforms and tools.
  • It has an active community in github and frequent knowledge sharing sessions on youtube. These resources help me learn from the experiences of other users, get updates on the latest features, and get support from the developers and contributors.
  • It has a Python SDK and client library — which allows me to programmatically ingest, query, and manipulate metadata using datahub-project within our existing python based implementation.
  • It has a good documentation that covers the installation, configuration, usage, and customization of datahub-project. The documentation also includes tutorials, guides, and best practices to help me get started and optimize my data governance workflow.
  • It brings a set of good practices upfront so I don’t have to rethink from scratch. For example, it supports data discovery, data observability, and federated governance using a push-based architecture, a modular design, and a rich metadata model. It also integrates with various data sources, such as Kafka, MySQL, Snowflake, BigQuery, and more.

--

--

Sairam Krish
Sairam Krish

Written by Sairam Krish

Software Architect ★ Data Architect

No responses yet