Today I’m going to share some experience of building a data engineering project that I always take pride in. You are going to learn the reasons behind why I used the tools and AWS components, and how I designed the architecture.
Disclaimer: The content of this text is inspired by my experience with an unnamed entity. However, certain critical commercial interests and details have intentionally been replaced with fictional data/codes or omitted, for the purpose of maintaining confidentiality and privacy. Therefore, the full and accurate extent of the actual commercial interests involved is reserved.
- Knowledge of Python
- Understanding of AWS components, such as DynamoDB, Lambda serverless, SQS and CloudWatch
- Comfortable coding experience with YAML & SAM CLI
Let’s say you are a data engineer and you need to constantly update the data in the warehouse. For example, you are responsible to sync up with the sales records of Dunder Mifflin Paper Co. on a regular basis. (I understand this is not a realistic scenario but have fun 🙂 !) The data is sent to you via a vendor’s API and you are held accountable for making sure the information of the branches, employees (actually only salespersons are considered), and sales are up-to-date. The provided API has the following 3 paths:
/branches, accepting branch name as a query parameter for retrieving the metadata of a specified branch;
/employees, accepting branch ID as a query parameter for retrieving the information of all its employees of a certain branch, the response includes a key-value pair that indicates the employees’ occupations;
/sales, accepting employee ID as a query parameter for retrieving the all-time sales records of a salesperson, the response includes a key-value pair that indicates when the transaction was complete.
So generally speaking, the returns of API look like this: