Boto3 vs AWS Wrangler: Simplifying S3 Operations with Python | by Antonello Benedetto | Jun, 2023


A comparative analysis for AWS S3 development

Picture by Hemerson Coelho on Unsplash

In this tutorial, we will delve into the world of AWS S3 development with Python by exploring and comparing two powerful libraries: boto3 and awswrangler.

If you’ve ever wondered

“What is the best Python tool to interact with AWS S3 Buckets? “

“How to perform S3 operations in the most efficient way?”

then you’ve come to the right place.

Indeed, throughout this post, we will cover a range of common operations essential for working with AWS S3 buckets among which:

  1. listing objects,
  2. checking object existence,
  3. downloading objects,
  4. uploading objects,
  5. deleting objects,
  6. writing objects,
  7. reading objects (standard way or with SQL)

By comparing the two libraries, we will identify their similarities, differences, and optimal use cases for each operations. By the end, you will have a clear understanding of which library is better suited for specific S3 tasks.

Additionally, for those who read to the very bottom, we will also explore how to leverage boto3 and awswrangler to read data from S3 using friendly SQL queries.

So let’s dive in and discover the best tools for interacting with AWS S3 and learn how to perform these operations efficiently with Python using both libraries.

The package versions used in this tutorial are:

  • boto3==1.26.80
  • awswrangler==2.19.0

Also, three initial files including randomly generated account_balances data have been uploaded to an S3 bucket named coding-tutorials:

Despite you should be aware that a number of ways exists to establish a connection to a S3 bucket, in this case, we are going to use the setup_default_session() from boto3:

# CONNECTING TO S3 BUCKET
import os
import io
import boto3
import awswrangler as wr
import pandas as pd

boto3.setup_default_session(aws_access_key_id = 'your_access_key',
aws_secret_access_key = 'your_secret_access_key')

bucket = 'coding-tutorials'

This method is handy as, once the session has been set, it can be shared by both boto3 and awswrangler, meaning that we won’t need to pass any more secrets down the road

Now let’s compare boto3 and awswrangler while performing a number of common operations and find what’s the best tool for the job.

The full notebook including the code that follows can be found in this GitHub folder.

# 1 Listing Objects

Listing objects, is probably the first operation we should perform while exploring a new S3 bucket and is a simple way to check whether a session has been correctly set.

With boto3 objects can be listed using:

  • boto3.client('s3').list_objects()
  • boto3.resource('s3').Bucket().objects.all()
print('--BOTO3--') 
# BOTO3 - Preferred Method
client = boto3.client('s3')

for obj in client.list_objects(Bucket=bucket)['Contents']:
print('File Name:', obj['Key'], 'Size:', round(obj['Size']/ (1024*1024), 2), 'MB')

print('----')
# BOTO3 - Alternative Method
resource = boto3.resource('s3')

for obj in resource.Bucket(bucket).objects.all():
print('File Name:', obj.key, 'Size:', round(obj.size/ (1024*1024), 2), 'MB')

Despite both client and resource classes do a decent job, the client class should be preferred, as it is more elegant and provides a large number of [easily accessible] low-level metadata as a nested JSON ( among which the object size).

On the other hand, awswrangler only provides a single method to list objects:

Being a high-level method, this does not return any low-level metadata about the object, such that to find the file size we need to call:

print('--AWS_WRANGLER--') 
# AWS WRANGLER

for obj in wr.s3.list_objects("s3://coding-tutorials/"):
print('File Name:', obj.replace('s3://coding-tutorials/', ''))

print('----')
for obj, size in wr.s3.size_objects("s3://coding-tutorials/").items():
print('File Name:', obj.replace('s3://coding-tutorials/', '') , 'Size:', round(size/ (1024*1024), 2), 'MB')

The code above returns:

Comparison → Boto3 Wins

Despite awswrangler is more straightforward to use, boto3 wins the challenge, while listing S3 objects. In fact, its low-level implementation, means that many more objects metadata can be retrieved using one of its classes. Such information is extremely useful while accessing S3 bucket in a programmatic way.

# 2 Checking Object Existence

The ability to check objects existence is required when we wish for additional operations to be triggered as a result of an object being already available in S3 or not.

With boto3 such checks can be performed using:

  • boto3.client('s3').head_object()
object_key = 'account_balances_jan2023.parquet'

# BOTO3
print('--BOTO3--')
client = boto3.client('s3')
try:
client.head_object(Bucket=bucket, Key = object_key)
print(f"The object exists in the bucket {bucket}.")
except client.exceptions.NoSuchKey:
print(f"The object does not exist in the bucket {bucket}.")

Instead awswrangler provides the dedicated method:

  • wr.s3.does_object_exist()
# AWS WRANGLER
print('--AWS_WRANGLER--')
try:
wr.s3.does_object_exist(f's3://{bucket}/{object_key}')
print(f"The object exists in the bucket {bucket}.")
except:
print(f"The object does not exist in the bucket {bucket}.")

The code above returns:

Comparison → AWSWrangler Wins

Let’s admit it: boto3 method name [head_object()] is not that intuitive.

Also having a dedicated method is undoubtedly and advantage of awswrangler that wins this match.

# 3 Downloading Objects

Downloading objects in local is extremely simple with both boto3 and awswrangler using the following methods:

  • boto3.client('s3').download_file() or
  • wr.s3.download()

The only difference is that download_file() takes bucket , object_key and local_file as input variables, whereas download() only requires the S3 path and local_file :

object_key = 'account_balances_jan2023.parquet'

# BOTO3
client = boto3.client('s3')
client.download_file(bucket, object_key, 'tmp/account_balances_jan2023_v2.parquet')

# AWS WRANGLER
wr.s3.download(path=f's3://{bucket}/{object_key}', local_file='tmp/account_balances_jan2023_v3.parquet')

When the code is executed, both versions of the same object are indeed downloaded in local inside the tmp/ folder:

Comparison → Draw

We can consider both libraries being equivalent as long as downloading files is concerned, therefore let’s call it a draw.

# 4 Uploading Objects

Same reasoning applies while uploading files from local environment to S3. The methods that can be employed are:

  • boto3.client('s3').upload_file() or
  • wr.s3.upload()
object_key_1 = 'account_balances_apr2023.parquet'
object_key_2 = 'account_balances_may2023.parquet'

file_path_1 = os.path.dirname(os.path.realpath(object_key_1)) + '/' + object_key_1
file_path_2 = os.path.dirname(os.path.realpath(object_key_2)) + '/' + object_key_2

# BOTO3
client = boto3.client('s3')
client.upload_file(file_path_1, bucket, object_key_1)

# AWS WRANGLER
wr.s3.upload(local_file=file_path_2, path=f's3://{bucket}/{object_key_2}')

Executing the code, uploads two new account_balances objects (for the months of April and May 2023) to the coding-tutorials bucket:

Comparison → Draw

This is another draw. So far there’s absolute parity between the two libraries!

# 5 Deleting Objects

Let’s now assume we wished to delete the following objects:

#SINGLE OBJECT
object_key = ‘account_balances_jan2023.parquet’

#MULTIPLE OBJECTS
object_keys = [‘account_balances_jan2023.parquet’,
‘account_balances_feb2023.parquet’,
‘account_balances_mar2023.parquet’]

boto3 allows to delete objects one-by-one or in bulk using the following methods:

  • boto3.client('s3').delete_object()
  • boto3.client('s3').delete_objects()

Both methods return a response including ResponseMetadata that can be used to verify whether objects have been deleted successfully or not. For instance:

  • while deleting a single object, a HTTPStatusCode==204 indicates that the operation has been completed successfully (if objects are found in the S3 bucket);
  • while deleting multiple objects, a Deleted list is returned with the names of successfully deleted items.
# BOTO3
print('--BOTO3--')
client = boto3.client('s3')

# Delete Single object
response = client.delete_object(Bucket=bucket, Key=object_key)
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']

if response['ResponseMetadata']['HTTPStatusCode'] == 204:
print(f'Object {object_key} deleted successfully on {deletion_date}.')
else:
print(f'Object could not be deleted.')

# Delete Multiple Objects
objects = [{'Key': key} for key in object_keys]

response = client.delete_objects(Bucket=bucket, Delete={'Objects': objects})
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']

if len(object_keys) == len(response['Deleted']):
print(f'All objects were deleted successfully on {deletion_date}')
else:
print(f'Object could not be deleted.')

On the other hand, awswrangler provides a method that can be used for both single and bulk deletions:

Since object_keys can be recursively passed to the method as a list_comprehension instead of being converted to a dictionary first like before – using this syntax is a real pleasure.

# AWS WRANGLER
print('--AWS_WRANGLER--')
# Delete Single object
wr.s3.delete_objects(path=f's3://{bucket}/{object_key}')

# Delete Multiple Objects
try:
wr.s3.delete_objects(path=[f's3://{bucket}/{key}' for key in object_keys])
print('All objects deleted successfully.')
except:
print(f'Objects could not be deleted.')

Executing the code above, deletes the objects in S3 and then returns:

Comparison → Boto3 Wins

This is tricky one: awswrangler has a simpler syntax to use while deleting multiple objects, as we can simply pass the full list to the method.

However boto3 returns a large number of information in the responsethat are extremely useful logs, while deleting objects programmatically.

Because in a production environment, low-level metadata is better than almost no metadata, boto3 wins this challenge and now leads 2–1.

# 6 Writing Objects

When it comes to write files to S3, boto3 does not even provide an out-of-the-box method to perform such operations.

For example, if we wanted to create a new parquet file using boto3, we would first need to persist the object on the local disk (using to_parquet() method from pandas) and then upload it to S3 using the upload_fileobj() method.

Differently from upload_file() (explored at point 4) the upload_fileobj() method is a managed transfer which will perform a multipart upload in multiple threads, if necessary:

object_key_1 = 'account_balances_june2023.parquet'

# RUN THE GENERATOR.PY SCRIPT

df.to_parquet(object_key_1)

# BOTO3
client = boto3.client('s3')

# Upload the Parquet file to S3
with open(object_key_1, 'rb') as file:
client.upload_fileobj(file, bucket, object_key_1)

On the other hand, one of the main advantages of the awswrangler library (while working with pandas) , is that it can be used to write objects directly to the S3 bucket (without saving them to the local disk) that is both elegant and efficient.

Moreover, awswrangler offers great flexibility allowing users to:

  • Apply specific compression algorithms like snappy , gzip and zstd;
  • append to or overwrite existing files via the mode parameter when dataset = True;
  • Specify one or more partitions columns via the partitions_col parameter.
object_key_2 = 'account_balances_july2023.parquet'

# AWS WRANGLER
wr.s3.to_parquet(df=df,
path=f's3://{bucket}/{object_key_2}',
compression = 'gzip',
partition_cols = ['COMPANY_CODE'],
dataset=True)

Once executed, the code above writes account_balances_june2023 as a single parquet file, and account_balances_july2023 as a folder with four files already partitioned by COMPANY_CODE:

Comparison → AWSWrangler Wins

If working with pandas is an option, awswrangler offers a much more advanced set of operations while writing files to S3, particularly when compared to boto3 that in this case, is not exactly the best tool for the job.

# 7.1 Reading Objects (Python)

As similar reasoning applies while trying to read objects from S3 using boto3: since this library does not offer a built in read method, the best option we have is to perform an API call (get_object()), read the Body of the response and then pass the parquet_object to pandas.

Note that pd.read_parquet() method expects a file-like object as input, which is why we need to pass the content read from the parquet_object as a binary stream.

Indeed, by using io.BytesIO() we create a temporary file-like object in memory, avoiding the need to save the Parquet file locally before reading it. This is in turn improves performance, especially when working with large files:

object_key = 'account_balances_may2023.parquet'

# BOTO3
client = boto3.client('s3')

# Read the Parquet file
response = client.get_object(Bucket=bucket, Key=object_key)
parquet_object = response['Body'].read()

df = pd.read_parquet(io.BytesIO(parquet_object))
df.head()

As expected, awswrangler instead excels at reading objects from S3, returning a pandas df as an output.

It supports a number of input formats like csv, json, parquet and more recently delta tables. Also passing the chunked parameter allows to read objects in a memory-friendly way:

# AWS WRANGLER
df = wr.s3.read_parquet(path=f's3://{bucket}/{object_key}')
df.head()

# wr.s3.read_csv()
# wr.s3.read_json()
# wr.s3.read_parquet_table()
# wr.s3.read_deltalake()

Executing the code above returns a pandas df with May data:

Comparison → AWSWrangler Wins

Yes, there are ways around the lack of proper methods in boto3. However, awswrangler is a library conceived to read S3 objects efficiently, hence it also wins this challenge.

# 7.2 Reading Objects (SQL)

The ones that managed to read until this point deserve a bonus and that bonus is reading objects from S3 using plain SQL.

Let’s suppose we wished to fetch data from the account_balances_may2023.parquet object using the query below (that filters data by AS_OF_DATE):

object_key = 'account_balances_may2023.parquet'
query = """SELECT * FROM s3object s
WHERE AS_OF_DATE > CAST('2023-05-13T' AS TIMESTAMP)"""

In boto3 this can be achieved via the select_object_content() method. Note how we should also specify the inputSerialization and OutputSerialization formats:

# BOTO3
client = boto3.client('s3')

resp = client.select_object_content(
Bucket=bucket,
Key=object_key,
Expression= query,
ExpressionType='SQL',
InputSerialization={"Parquet": {}},
OutputSerialization={'JSON': {}},
)

records = []

# Process the response
for event in resp['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'].decode('utf-8'))

# Concatenate the JSON records into a single string
json_string = ''.join(records)

# Load the JSON data into a Pandas DataFrame
df = pd.read_json(json_string, lines=True)

# Print the DataFrame
df.head()

If working with pandas df is an option, awswrangler also offers a very handy select_query() method that requires minimal code:

# AWS WRANGLER
df = wr.s3.select_query(
sql=query,
path=f's3://{bucket}/{object_key}',
input_serialization="Parquet",
input_serialization_params={}
)
df.head()

For both libraries, the returned df will look like this:

In this tutorial we explored 7 common operations that can be performed on S3 buckets and run a comparative analysis between boto3 and awswrangler libraries.

Both approaches allow us to interact with S3 buckets, however the main difference is that the boto3 client provides low-level access to AWS services, while awswrangler offers a simplified and more high-level interface for various data engineering tasks.

Overall, awswrangler is our winner with 3 points (checking objects existence, write objects, read objects) vs 2 points scored by boto3 (listing object, deleting objects). Both the upload/download objects categories were draws and did not assign points.

Despite the result above, the truth is that both libraries give their best when used interchangeably, to excel in the tasks they have been built for.

Sources



Source link

Leave a Comment