There are many tools for data engineering. Data engineering doesn’t work solely on Python or SQL. With the era of big data and cloud computing, new data-related open-source projects have been emerging than ever before. Not only do we have multiple options for your tasks, but too many of them that require data engineers to know the pros and cons of each tool to pick.
There are many similar tools for data engineering. The data engineering market is competitive. Many open-source vendors provide almost identical services with minor tweaks—all claim to be the best of the class, have lightning-fast performance, and have an engaging community.
Tools equipped me like a handyperson. I can pull a data framework for a suitable situation from my selections. A few years ago, I loved chasing down the latest and undiscovered data engineering tools. Many of those projects didn’t become popular, and many of them needed to be updated. It’s terrifying that you find an excellent project on GitHub; the following year, it is abandoned. That’s the speed of a data engineering framework dying out if it doesn’t get enough usage.
One key learning I had while chasing the latest tool is: Tools are great, but many data engineering problems cannot be resolved by using the newest tool but by human — Data Engineers.
I want to share my thoughts on Why data engineering is about much more than just the tools you use.
What is the purpose for data engineers to spend hours writing a data pipeline job that runs daily/hour? THE BUSINESS NEEDS IT.
No matter your industry, it requires data collection, processing, and summarizing data into insights to help your business. Even if you run the business with paper or Excel, data is inevitable for business success.
Why is data essential for business? If we remove the fabulous and shiny terms like — forecasting, machine learning, and AI, the fundamental reason that business needs data is to ensure the company can continue running. Without the business use case, the newest and most excellent data engineering tool fades its color.
I once deployed the latest data engineering framework to process data in a streaming fashion. It works well at first as a proof-of-concept project.
However, after the excitement of showing a new way to run the job in a streaming manner, I started to need help to productionize it. There needs to be a more robust use case for streaming, and people are still happy with the batch approaches. I tried to find and persuade others to convert batch to streaming, and it didn’t go well. I also found migration is costly, 24/7 streaming support is a pain, and many bugs drain my time on such a new framework.
Data engineering only exists with use cases for data. If we look at open-source data engineering projects, many projects need help finding a group of users with use cases leaning toward those projects.
If your business has the use case, explore the tooling side further. But keep the tooling from driving any business use case.
Many data engineering joke about themselves as plumbers. This metaphor probably covers 70% of a data engineer’s responsibility — building the data pipelines.
No matter whether your data pipelines are in a distributed framework like Spark/Flink/Beam, purely SQL, or even a Python/R function, you have the three key things to consider:
- Where is the data coming from (Extract)
- How should we change the data (Transformation)
- Where should I send that data to (Loading)
We can also summarize this process as ETL (Extract, Transformation, Loading). The Extract and Loading are about the data’s location and storage read/write efficiency. Transformation is the core part that applies business logic.
Data engineering is needed as the middleman or center of the data. Although it often only starts when the event or log is generated, getting involved in data generation early is critical, making it much easier to read and less chaotic to consume.
Once the data lands in a data storage — Kafka queue, database, or file in S3, it usually marks the starting time for many data engineering projects.
The goal of data engineering in this ETL process is to ensure a piece of data can land on time for the end users to derive value for helping businesses make decisions.
Usually, the data engineering team won’t stand on the stage to show the sales forecast or traffic increase due to the new LLM model deployed. Many data engineers work behind the scenes and are the show’s backbone.
Deriving the business value is more critical than tooling. Replacing an existing tool to make the pipeline run 10% faster than building essential metrics that can drive a 1% revenue gain is less important.
The remaining 30% of responsibility is to gain domain knowledge, understand how data is generated in the first place (not transformed and given to you by some team), learn the preference of end users on how to build metrics and dimensions and collaborate (data engineering isn’t a sole role)
Tools can be a great help to save you time building the ETL metrics and help you more efficiently improve your velocity. The core part of digging the business out of the massive data is unique and often hard to achieve by adding other tools.
Eventually, data will be served to the users. All the hard work is paid out at this stage.
You can put the data to consume into tiers or serve in analytics-friendly storage like an OLAP query engine. The business’s most critical metrics and dimensions should be allocated as a gold dataset for the most crucial business use case.
We often discussed leveraging OLAP, columnar storage like parquet, and parallel processing. However, the schema, data model design, user-friendly naming, and data catalog must be evaluated more.
Knowing many issues aren’t simply by throwing more computing resources is a hard lesson. The first time adding more resources could work, but it will only sometimes be the case. Until then, a more fundamental issue has surfaced, is there anything with how the data has been queried or flaws in the design?
Before adding wasteful resources and redundant work, rethink the data modeling design, interact with the users, and develop real use cases to evaluate if they are covered. This process must be addressed as data engineers are busy assessing and trying the available toolings.
My experience is any popular tool can archive 90% of the requirements for data engineering tasks. We shouldn’t spend 90% of the time chasing down the best and fast ones but leave 10% of the time focusing on the most critical things — serving the data to help business to derive insights.
If you’d like to learn more about data warehouses with dimensional modeling, please refer to my data warehouse 101 articles.
A new tool adoption could drop the old logic. Have you ever had a piece of code or reason that no one was too afraid to touch? Asking why before throwing away old business logic multiple times can bring less hurtle.
The old logic exists for its reason; try to embrace and understand how it reached this stage instead of advocating the fanciness of the new tools with less respect to the old world.
So don’t think data engineering is just mastering a variety of tools. A deep understanding of the data engineering position in the data life cycle for a business use case and better serving that use case is the most critical part of tooling.
I hope this article is helpful to help you rethink the foundation of data engineering, and I highly recommend Joe Reis (Author) and Matt Housley’s (Author) book — Fundamentals of Data Engineering: Plan and Build Robust Data Systems.