Deploying LLMs On Amazon SageMaker With DJL Serving | by Ram Vegiraju | Jun, 2023

Deploy BART on Amazon SageMaker Real-Time Inference

Image from Unsplash

Large Language Models (LLMs) and Generative AI continue to take over the Machine Learning and general tech space in 2023. With the LLM expansion has come an influx of new models that continue to improve at a stunning rate.

While the accuracy and performance of these models are incredible, they have their own set of challenges in terms of hosting these models. Without model hosting, it is hard to recognize the value that these LLMs provide in real-world applications. What are the specific challenges with LLM hosting and performance tuning?

  • How can we load these larger models that are scaling up to past 100s of GBs in size?
  • How can we properly apply model partitioning techniques to efficiently utilize hardware while not compromising on model accuracy?
  • How can we fit these models on a singular GPU or multiple?

These are all challenging questions that are addressed and abstracted out through a model server known as DJL Serving. DJL Serving is a high performance universal solution that integrates directly with various model partitioning frameworks such as the following: HuggingFace Accelerate, DeepSpeed, and FasterTransformers. With DJL Serving you can configure your serving stack to utilize these partitioning frameworks to optimize inference at scale across multiple GPUs with these larger models.

In today’s article in specific we explore one of the smaller language models in BART for Feature Extraction. We will showcase how you can use DJL Serving to configure your serving stack and host a HuggingFace Model of your choice. This example can serve as a template to build upon and utilize the model partitioning frameworks aforementioned. We will then take our DJL specific code and integrate it with SageMaker to create a Real-Time Endpoint that you can use for inference.

NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I would suggest following this article for understanding Deployment/Inference more in depth.

What is a Model Server? What Model Servers Does Amazon SageMaker Support?

Model Servers at a very basic premise are “inference as a service”. We need an easy way to expose our models via an API, but these model servers take care of the grunt work behind the scenes. These model servers load and unload our model artifacts and provide the runtime environment for your ML models that you are hosting. These model servers can also be tuned depending on what they expose to the user. For example, TensorFlow Serving gives the choice of gRPC vs REST for your API calls.

Amazon SageMaker integrates with a variety of these different model servers that are also exposed via the different Deep Learning Containers that AWS provides. Some of these model servers include the following:

For this specific example we will utilize DJL Serving as it is tailored for Large Language Model Hosting with it’s different model partitioning frameworks it has enabled. That does not mean the server is limited to LLMs, you can also utilize it for other models as long as you are properly configuring the environment to install and load up any other dependencies.

At a very high level overview depending on the model server that you are using the way you bake and shape your artifacts that you provide the server is the only difference along with whatever model frameworks and environments they support as well.

DJL Serving vs JumpStart

In my previous article we explored how we could deploy Cohere’s Language Models via SageMaker JumpStart. Why not use SageMaker JumpStart in this case? At the moment not all LLMs are supported by SageMaker JumpStart. In the case that there’s a specific LLM that JumpStart does not support it makes sense to use DJL Serving.

The other major use case for DJL Serving is when it comes to customization and performance optimization. With JumpStart you are constrained to the model offering and whatever limitations exist with the container that’s already been pre-baked for you. With DJL there is more code work at a container level but you can apply performance optimization techniques of your choice with the different partitioning frameworks that exist.

DJL Serving Setup

For this code example we will be utilizing a ml.c5.9xlarge SageMaker Classic Notebook Instance with a conda_amazonei_pytorch_latest_p37 kernel for development.

Before we can get to DJL Serving Setup we can quickly explore the BART model itself. This model can be found in the HuggingFace Model Hub and can be utilized for a variety of tasks such as Feature Extraction and Summarization. The following code snippet is how you can utilize the BART Tokenizer and Model for a sample inference locally.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
model = AutoModel.from_pretrained("facebook/bart-large")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Now we can map this model to DJL Serving with a few specific files. First we define a file which essentially defines the configuration for your model deployment. In this case we specify a few parameters.

  • Engine: We are utilizing Python for the DJL Engine, the other options here are also DeepSpeed, FasterTransformers, and Accelerate.
  • Model_ID: For the HuggingFace Hub each model has a model_id that can be used as an identifier, we can feed this into our model script for model loading.
  • Task: For HuggingFace specific models you can include a task as many of these models can support various language tasks, in this case we specify Feature Extraction.

Other configurations you can specify for DJL include: tensor_parallel degree, minimum and maximum workers on a per model basis. For an extensive list of properties you can configure please refer to the following documentation.

The next files we provide are our actual model artifact and a requirements.txt for any additional libraries you will utilize in your inference script.


In this case we have no model artifacts as we will directly load the model from the HuggingFace Hub in our inference script.

In our Inference Script ( we can create a class that captures both model loading and inference.

class BartModel(object):
Deploying Bart with DJL Serving

def __init__(self):
self.initialized = False

Our initialize method will parse our file and load the BART Model and Tokenizer from the HuggingFace Model Hub. The properties object essentially contains everything you have defined in the file.

def initialize(self, properties: dict):
Initialize model.

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
model = AutoModel.from_pretrained("facebook/bart-large")
self.model_name = properties.get("model_id")
self.task = properties.get("task")
self.model = AutoModel.from_pretrained(self.model_name)
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.initialized = True

We then define an inference method which accepts a string input and tokenizes the text for the BART Model inference that we can copy from the local inference example above.

def inference(self, inputs):
Custom service entry point function.

:param inputs: the Input object holds the text for the BART model to infer upon
:return: the Output object to be send back

#sample input: "This is the sample text that I am passing in"

data = inputs.get_as_string()
inputs = self.tokenizer(data, return_tensors="pt")
preds = self.model(**inputs)
res = preds.last_hidden_state.detach().cpu().numpy().tolist() #convert to JSON Serializable object
outputs = Output()

except Exception as e:
logging.exception("inference failed")
# error handling
outputs = Output().error(str(e))

We then instantiate this class and capture all of this in the “handle” method. By default for DJL Serving this is the method that the handler parses for in the inference script.

_service = BartModel()

def handle(inputs: Input):
Default handler function
if not _service.initialized:
# stateful model

if inputs.is_empty():
return None

return _service.inference(inputs)

We now have all the necessary artifacts on the DJL Serving side and can configure these files to fit the SageMaker constructs to create a Real-Time Endpoint.

SageMaker Endpoint Creation & Inference

For creating a SageMaker Endpoint the process is very similar to that of other Model Servers such as MMS. We need two artifacts to create a SageMaker Model Entity:

  • model.tar.gz: This will contain our DJL specific files and we organize these in a format that the model server expects.
  • Container Image: SageMaker Inference always expects a container, in this case we use the DJL Deepseed image provided and maintained by AWS.

We can create our model tarball, upload it to S3 and then retrieve our image to get the artifacts ready for Inference.

import sagemaker, boto3
from sagemaker import image_uris

# retreive DeepSpeed image
img_uri = image_uris.retrieve(framework="djl-deepspeed",
region=region, version="0.21.0")

# create model tarball
bashCommand = "tar -cvpzf model.tar.gz requirements.txt"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

# Upload tar.gz to bucket
model_artifacts = f"s3://{bucket}/model.tar.gz"
response = s3.meta.client.upload_file('model.tar.gz', bucket, 'model.tar.gz')

We can then utilize the Boto3 SDK to conduct our Model, Endpoint Configuration, and Endpoint creation. The only change from the usual three API calls is that in the Endpoint Configuration API call we specify Model Download Timeout and Container Health Check Timeout parameters to higher numbers as we are dealing with a larger model in this case. We also utilize a g5 family instance for the additional GPU compute power. For most LLMs, GPUs are mandatory to be able to host models at this size and scale.

client = boto3.client(service_name="sagemaker")

model_name = "djl-bart" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)
create_model_response = client.create_model(
PrimaryContainer={"Image": img_uri, "ModelDataUrl": model_artifacts},
print("Model Arn: " + create_model_response["ModelArn"])

endpoint_config_name = "djl-bart" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

production_variants = [
"VariantName": "AllTraffic",
"ModelName": model_name,
"InitialInstanceCount": 1,
"InstanceType": 'ml.g5.12xlarge',
"ModelDataDownloadTimeoutInSeconds": 1800,
"ContainerStartupHealthCheckTimeoutInSeconds": 3600,

endpoint_config = {
"EndpointConfigName": endpoint_config_name,
"ProductionVariants": production_variants,

endpoint_config_response = client.create_endpoint_config(**endpoint_config)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

endpoint_name = "djl-bart" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Once the endpoint has been created we can perform a sample inference utilizing the invoke_endpoint API call and you should see a numpy array returned.

runtime = boto3.client(service_name="sagemaker-runtime")
response = runtime.invoke_endpoint(
Body="I think my dog is really cute!")
result = json.loads(response['Body'].read().decode())

Additional Resources & Conclusion

You can find the code for the entire example at the link above. LLM Hosting is still a growing space with many challenges that DJL Serving can help simplify. Paired with the hardware and optimizations SageMaker provides this can help enhance your inference performance for LLMs.

As always feel free to leave any feedback or questions around the article. Thank you for reading and stay tuned for more content in the LLM space.

Source link

Leave a Comment