From Data Lakes to Data Mesh: A Guide to the Latest Enterprise Data Architecture | by Col Jung | May, 2023


Problem 3 — Fence-throwing

Dehghani calls the third and final mode of failure siloed and hyper-specialised ownership, which I like to think as resulting in unproductive fence-throwing.

Our hyper-specialised big data lake engineers working in the data lake are organisationally-siloed away from where the data originates and where it will be consumed.

Siloed hyper-specialised data platform team. Source: Z. Dehghani at MartinFowler.com (with permission)

This creates a poor incentive structure that does not promote good delivery outcomes. Dehghani articulates this as…

“I personally don’t envy the life of a data platform engineer. They need to consume data from teams who have no incentive in providing meaningful, truthful and correct data. They have very little understanding of the source domains that generate the data and lack the domain expertise in their teams. They need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain’s experts.

What we find are disconnected source teams, frustrated consumers fighting for a spot on top of the data platform team backlog and an over stretched data platform team.”

Data producers will ‘pack together’ some of their data and throw it over the fence to the data engineers.

Your problem now! Good luck guys!

Overworked data engineers, who may or may not have done justice to the ingested data given that they’re not data domain experts, will themselves throw some processed data out of the lake to serve downstream consumers.

Good luck, analysts and data scientists! Time for a quick nap and then I’m off to fix the fifty broken ETL pipelines on my backlog.

As you can see from Problems 2 and 3, the challenges that have arisen from the data lake experiment are as much organisational as technological.

Takeaways:

By federating data management to individual business domains, perhaps we could foster a culture of data ownership and collaboration and empower data producers, engineers and consumers to work together?

And hey, can we give these domains a real stake in the game?

Empower them to take pride in building strategic data assets by incentivising them to treat data like a hot-selling product?

In 2019, Dehghani proposed data mesh as the next-generation data architecture that embraces a decentralised approach to data management.

Her initial articles — here and here — generated significant interest in the enterprise data community that has since prompted many organisations worldwide to begin their own data mesh journey, including mine.

Rather than pump data into a centralised lake, data mesh federates data ownership and processing to domain-specific teams that control and deliver data as a product, promoting easy accessibility and interconnectivity of data across the entire organisation, enabling faster decision-making and promoting innovation.

Overview of data mesh. Source: Data Mesh Architecture (with permission)

The data mesh dream is to create a foundation for extracting value from analytical data at scale, with scale being applied to:

  • An ever-changing business, data and technology landscape.
  • Growth of data producers and consumers.
  • Varied data processing requirements. A diversity of use cases demand a diversity of tools for transformation and processing. For instance, real-time anomaly detection might leverage Apache Kafka; an NLP system for customer support often leads to data science prototyping on Python packages like NLTK, image recognition leverages deep learning frameworks like TensorFlow & PyTorch; and the fraud detection team at my bank would love to process our big data with Apache Spark.

All these requirements have created technical debt for warehouses (in the form of a mountain of unmaintainable ETL jobs) and a bottleneck for data lakes (due to the mountain of diverse work that’s squeezed through a small centralised data team).

Organisations eventually behold a threshold mountain of complexity where the technical debt outweigh the value provided.

It’s a terrible situation.

To address these problems, Dehghani proposed four principles that any data mesh implementation must embody in order to realise the promise of scale, quality and usability.

The 4 Principles of Data Mesh. Source: Data Mesh Architecture (with permission)
  1. Domain Ownership of Data: By placing data ownership in the hands of domain-specific teams, you empower those closest to the data to take charge. This approach enhances agility to changing business requirements and effectiveness in leveraging data-driven insights, which ultimately leads to better and more innovative products and services, faster.
  2. Data as a Product: Each business unit or domain is empowered to infuse product thinking to craft, own and improve quality and reusable data products — a self-contained and accessible data set treated as a product by the data’s producers. The goal is to publish and share data products across the data mesh to consumers sitting in other domains — considered as nodes on the mesh — so that these strategic data assets can be leveraged by all.
  3. Self-Serve Data Platform: Empowering users with self-serve capabilities paves the way for accelerated data access and exploration. By providing a user-friendly platform equipped with the necessary tools, resources, and services, you empower teams to become self-sufficient in their data needs. This democratisation of data promotes faster decision-making and a culture of data-driven excellence.
  4. Federated Governance: Centralised control stifles innovation and hampers agility. A federated approach ensures that decision-making authority is distributed across teams, enabling them to make autonomous choices when it counts. By striking the right balance between control and autonomy, you foster accountability, collaboration and innovation.

Wondering how to build and deploy a data mesh? What does that look like?

For most organisations, the mesh won’t be a side-project you deploy once ready. In all likelihood, you’ll need to cleverly federate your existing data lake piece-by-piece until you reach a platform that is ‘sufficiently mesh’.

Think swapping out two aircraft engines for four smaller ones mid-flight, rather than buying a new plane in a nice shady hanger somewhere.

Or trying to upgrade a road while trying to keep some lanes open at all times to traffic, instead of building a new road silo’ed away somewhere and opening it once everything is nicely paved.

Full mesh maturity may take a long time, because data mesh is primarily an organisational construct. It is as about operating models — in other words, people — as the technology itself, meaning cultural uplift and bringing people along for the journey is essential.

Rest assured however — slowly but surely, your centralised domain-agnostic monolithic data lake will become a decentralised domain-oriented modular data mesh.

Some considerations for the design phase. Check out datamesh-architecture.com for a deeper dive.

  • Domains. A data mesh architecture comprises a set of business domains, each with a domain data team who can perform cross-domain data analysis on their own. An enabling team — often part of the transformation office of the organisation — spreads the idea of mesh across the organisation and serve as advocates. They help individual domains on a consultancy basis on their journey to become a ‘full member’ of the data mesh. The enabler team will comprise experts on data architecture, data analytics, data engineering and data governance.
  • Data products. Domains will ingest their own operational data — which they sit very close to and understand — and build analytical data models as data products that can be published on the mesh. Data products are owned by the domain, who is responsible for its operations, quality and uplift during its entire lifecycle. Effective accountability to ensure effective data.
The sharing of data products across the mesh. Source: Data Mesh Architecture (with permission)
  • Self-serve. Remember those ‘multicultural food days’ at school, where everyone brought their delicious dishes and shared them at a self-serve table? The teacher’s minimalist role was to oversee operations and ensure everything went smoothly. In a similar vein, mesh’s newly streamlined central data team endeavour to provide and maintain a domain-agnostic ‘buffet table’ of diverse data products from which to self-serve. Business teams can perform their own analysis with little overhead and offer up their own data products to their peers. A delicious data feast where everyone can also be the chef.
  • Federated governance. Each domain will self-govern their own data and be empowered to walk at the beat of its own drum — like European Union member states. On certain matters where it makes sense to unite and standardise, they will strike agreements with other domains on global policies, such as documentation standards, interoperability and security in a federated governance group — like the European Parliament so that individual domains can easily discover, understand, use and integrate data products available on the mesh.

Here’s the exciting bit — when will our mesh hit maturity?

The mesh emerges when teams start using other domain’s data products.

This serves as a useful benchmark to aim for to attest that your data mesh journey has reached a threshold level of maturity.

A good time to pop the champagne.

Data mesh is a relatively new idea, having only been invented around 2018 by architect Zhamek Dehghani.

It has gained significant momentum in the data architecture and analytics communities as an increasing number of organisations grapple with the scalability problems of a centralised data lake.

By moving away from an organisational structure where data is controlled by a single team and towards a decentralised model where data is owned and managed by the teams that use it the most, different parts of the organisation can work independently — with greater autonomy and agility — while still ensuring that the data is consistent, reliable and well-governed.

Data mesh promotes a culture of accountability, ownership and collaboration, where data is productised and treated as a first-class citizen that’s proudly shared across the company in a seamless and controlled manner.

The aim is attaining a truly scalable and flexible data architecture that aligns with the needs of modern organisations where data is central to driving business value and innovation.

Summarising the Four Principles of Data Mesh. Credit: Z. Dehghani at MartinFowler.com (with permission)

My company’s own journey towards data mesh is expected to take a couple of years for the main migration, and longer for full maturity.

We’re working on three major parts simultaneously:

  • Cloud. An uplift from our Cloudera stack on Microsoft Azure IaaS to native cloud services on Azure PaaS. More info here.
  • Data products. An initial array of foundational data products are being rolled out, which can be used and re-assembled in different combinations like Lego bricks to form larger more valuable data products.
  • Mesh. We’re decentralising our data lake to a target state of at least five nodes.

What a ride it has been. When I started half a decade ago, we were just getting started building out our data lake using Apache Hadoop on top of on-prem infrastructure.

Countless challenges and invaluable lessons have shaped our journey.

Like any determined team, we fail fast and fail forward. Five short years later, we have completely transformed our enterprise data landscape.

Who knows what things will look like in another five years? I look forward to it.

Find me on Linkedin, Twitter & YouTube.





Source link

Leave a Comment