How to Automate PySpark Pipelines on AWS EMR With Airflow | by Antonello Benedetto | Aug, 2023


Optimising big data workflows orchestration

Photo By Tom Fisk On Pexels

In the dynamic landscape of data engineering and analytics, building scalable and automated pipelines is paramount.

Spark enthusiasts who have been working with Airflow for a while might be wondering:

How to execute a Spark job on a remote cluster using Airflow?

How to automate Spark pipelines with AWS EMR and Airflow?

In this tutorial we are going to integrate these two technologies by showing how to:

  1. Configure and fetch essential parameters from the Airflow UI.
  2. Create auxiliary functions to automatically generate the preferred spark-submit command.
  3. Use Airflow’s EmrAddStepsOperator() method to build a task that submits and executes a PySpark job to EMR
  4. Use Airflow’s EmrStepSensor() method to monitor the script execution.

The code used in this tutorial is available on GitHub.

  • An AWS account with a S3 bucket and EMR cluster configured on the same region ( in this case eu-north-1). The EMR cluster should be available and in WAITING state. In our case it has been named emr-cluster-tutorial:
Photo By the Author (Personal EMR Cluster)
  • Some mock balances data already available in the S3 bucket under the src/balances folder. Data can be generated and written to the location using the data producer script.
  • The required JARs should already downloaded from Maven and available in the S3 bucket.
  • Docker installed and running on the local machine with 4-6 GB of allocated memory.

The goal is to write some mock data in parquet format to a S3a bucket and then build a DAG that:

  • Fetches required configuration from Airflow UI;
  • Uploads a pyspark script to the same S3a bucket;



Source link

Leave a Comment