Generative AI has already started shaking the world of Data Governance, and it is set to keep doing so.
It’s just been 6 months since ChatGPT’s release, but it feels like we need a retrospective already. In this piece, I’ll explore how generative AI is impacting data governance, and where it’s likely to take us in the near future. Let me emphasize near because things evolve quickly, and they can go a lot of different ways. This article isn’t about forecasting the next 100 years of data governance, but rather a practical look at the changes happening now and those just on the horizon.
Before diving in, let’s remind ourselves of what data governance deals with.
Keeping things simple, data governance is the set of rules or processes that an organization follows to ensure the data is trustworthy. It involves 5 key areas:
- Metadata and Documentation
- Search and Discovery
- Policies and Standards
- Data Privacy and Security
- Data Quality
In this piece, we’ll look at how each of these areas is set to evolve once we incorporate generative AI in the mix.
Let’s do this!
Metadata and documentation is probably the most important part of data governance, and the other parts build heavily of this one being done properly. AI has already started, and will continue to change the way we create data context. But I dont want to get your hopes too high. We still need humans in the loop when it comes to documentation.
Producing context around data, or documenting the data has two parts. The first element, which makes up about 70% of the job, involves documenting general information, common for many companies. A very basic example is the definition of “email” which is common to all companies. The second part is about writing down the specific know-how that’s unique to your company.
Here’s the exciting part: AI can do a lot of the heavy lifting for the first 70%. It’s because the first element involves general knowledge, and generative AI is excellent at handling that.
Now, what about knowledge that’s peculiar to your company? Every organization is unique, and this uniqueness gives rise to your own specific company language. This language is your metrics, KPIs, and business definitions. And it isn’t something that can be imported from outside. It’s born from the people who know the business best = its employees.
In my conversations with data leaders, I often discuss how to create a shared understanding of these business concepts. Many leaders share that to achieve this alignment, they bring domain teams in the same room to talk, debate, and agree upon the definitions that best fit their business model.
Let’s take, for example, the definition of a ‘customer.’ For a subscription-based business, a customer could be someone who’s currently subscribed to their service. But for a retail business, a customer might be anyone who’s made a purchase in the last 12 months. Each company defines ‘customer’ in a way that makes the most sense for them, and this understanding usually emerges from within the organization.
When it comes to such peculiar knowledge, AI, as smart as it is, can’t do this part just yet. It can’t sit in on your meetings, join in the discussion, or help new concepts bloom. For Andreessen Horowitz, this might become possible when the second wave of AI hits. For now, we are still at wave 1.
I’d also like to touch on a question posed by Benn Stancil. Benn asks: If a bot can write data documentation on demand for us, what’s the point of writing it down at all?
There is some truth to this: if generative AI can generate content on demand, why not just generate it when you need it, instead of bothering with documenting everything? Unfortunately, it does not work like this, for two reasons.
First, as I’ve previously explained, a part of documentation covers the unique aspects of a company that AI cannot capture yet. This calls for human expertise. It cannot be generated on the fly by AI.
Second, while AI is advanced, it’s not infallible. The data it generates isn’t always accurate. You need to make sure a human checks and confirms all AI-produced content.
Generative AI is not just changing the way we create documentation but also how we consume it. In fact, we’re witnessing a paradigm shift in search and discovery methods. The traditional methods, where analysts search through your data catalog seeking out relevant information, are quickly becoming outdated.
A true game changer lies in AI’s ability to become a personal data assistant to everyone in the company. In some data catalogs, you can already approach the AI with your specific data inquiries. You can ask questions such as, “Is it possible to perform action X with the data?”, “Why am I unable to use the data to achieve Y?”, or “Do we possess data that illustrates Z?”. If your data is enriched with the right context, AI will help disseminate this context across the whole company.
Another development we’re expecting is that AI will transform the data catalog from a passive entity to an active helper. Think about it this way: if you’re using a formula incorrectly, the AI assistant could give you a heads-up. Likewise, if you’re about to write a query that already exists, the AI could let you know and guide you to the existing piece of work.
In the past, data catalogs just sat there, waiting for you to sift through them for answers. But with AI, catalogs could start actively helping you, offering insights and solutions before you even realize you need them. This would be complete shift in how we engage with data, and it might be happening very soon.
Yet, there is a condition for the AI assistant to work effectively: your data catalog must be maintained. To ensure that the AI assistant provides reliable guidance to stakeholders, the underlying documentation must be 100% trustworthy. If the catalog is not properly maintained, or if the policies are not clearly defined, then the AI assistant will spread incorrect information throughout the company. This would be more detrimental than having no information at all, as it could lead to poor decision-making based on the wrong context.
You’ve probably understood it: AI and data governance are interdependent. AI can enhance data governance, but in turn, robust data governance is required to fuel the capabilities of AI. This results in a virtuous cycle where each component boosts the other. But you need to keep in mind that no element can replace the other.
Another key component of data governance is the formulation and implementation of governance rules.
This usually involves defining data ownership and domains within the organization. Right now, AI isn’t up to the task when it comes to defining these policies and standards. AI shines when it comes to executing rules or flagging infractions, but it is lacking when tasked with creating the rules themselves.
This is for a simple reason. Defining ownership and domains pertains to human politics. For example, ownership means deciding who within the organization has the authority over specific datasets. This could include the power to make decisions about how and when the data is used, who has access to it, and how it’s maintained and secured. Making these decisions often involves negotiating between individuals, teams, or departments, each with their own interests and perspectives. And human politic, for obvious reasons, cannot be replaced by AI.
We thus expect that humans will continue to play a significant role in this aspect of governance in the near future. Generative AI can play a role in drafting an ownership framework or suggesting data domains. However, keeping humans in the loop still remains a must.
However, generative AI is set to shake things up in the privacy department of governance. Managing privacy rights is a traditionally feared aspect of governance. Nobody enjoys it. It involves manually creating a complex architecture of permissions to make sure sensitive data is protected.
The good news is: AI can automate much of this process. Given parameters such as the number of users and their respective roles, AI can create rules for access rights. The architectural aspect of access rights, being fundamentally code-based, aligns well with AI’s capabilities. The AI system can process these parameters, generate relevant code, and apply it to manage data access efficiently.
Another area where AI can make a big impact is in the management of Personally Identifiable Information (PII). Today, PII tagging is usually done manually, making it a burden for the person in charge of it. This is something AI can automate completely. By leveraging AI’s pattern recognition capabilities, PII tagging can be conducted more accurately than when it’s done by a human. In this sense, using AI could actually improve the way we we manage privacy protection.
This does not imply that AI will completely replace human involvement. Despite AI’s capabilities, we still need human oversight to manage unexpected situations and make judgment calls when needed.
Let’s not forget about data quality, which is an important pillar of governance. Data quality ensures that the information used by a company is accurate, consistent, and reliable. Maintaining data quality has always been a complex endeavor, but things are already changing with generative AI.
As I mentioned above, AI is great at applying rules and flagging infractions. This makes it easy for algorithms to identify anomalies in the data. You can find a detailed account on how AI affects different aspects of data quality in this article.
AI can also lower the technical barrier of data quality. This is something SODA is already putting in place. Their new tool, SodaGPT, offers a no-code approach to express data quality checks, enabling users to perform quality checks using natural language alone. This allows data quality maintenance to become much more intuitive and accessible.
We’ve seen that AI can supercharge Data Governance in a way that is triggering the beginning of a paradigm shift. A lot of changes are already happening, and they are here to stay.
However, AI can only build on a foundation that’s already solid. For AI to change the search and discovery experience in your company, you must already be maintaining your documentation. AI is powerful, but it can’t miraculously mend a system that is flawed.
The second point to keep in mind is that even if AI can be used to generate most of the context around data, it cannot replace the human element entirely. we still need humans in the loop for validation and for documenting the knowledge unique to each company. So our one sentence prediction for the future of governance: turbocharged by AI, anchored in human discernment and cognition.
At CastorDoc, we are building a data documentation tool for the Notion, Figma, Slack generation.
Want to check it out? Reach out to us and we will show you a demo.