Let’s check out some code excerpts for our small, illustrative example. You find the full code in the accompanying notebook and a more complete implementation that we use in the following articles is in the same repository.

Let’s start with how we setup an adapter. We pass in a reference to the module to be adapted, which we now call the `adaptee`

. We store a reference to its original `forward`

method and let the `adaptee`

’s forward method now point to the adapter’s `forward`

method’s implementation.

`class LoRAAdapter(nn.Module):`

def __init__(self,

adaptee, # <- module to be adapted

r):

super().__init__()self.r = r

self.adaptee = adaptee

# Store a pointer to the original forward implementation

# of the module to be adapted.

# Then point its forward method to this adapter module.

self.orig_forward = adaptee.forward

adaptee.forward = self.forward

[..]

Now that we have setup the mechanics of the integration, we also initialize the parameters of our low rank matrices. Recognize that we initialize one matrix with 0 and one randomly:

` [..]`

# Adding the weight matrices directly to the adaptee,

# which makes is more practical to report the parameters,

# and to remove it later.

adaptee.lora_A = (nn.Parameter(torch.randn(adaptee.in_features, r)/

math.sqrt(adaptee.in_features)))

adaptee.lora_B = nn.Parameter(torch.zeros(r, adaptee.out_features))

And finally, still part of the `LoRAAdapter`

class, we have our `forward`

method that first calls the `adaptee`

’s `forward`

method with our input `x`

. That is the original path executed in the original module. But we then also add that result to that from our adapted branch, where we matrix multiply the input `x`

with `A`

and `B`

.

`def forward(self, x, *args, **kwargs):`

return (

self.orig_forward(x, *args, **kwargs) +

x @ self.adaptee.lora_A @ self.adaptee.lora_B

)

This simplicity looks elegant to my eye.

There are more details that could be interesting, but are best explained alongside code. You find these in the accompanying notebook:

- How to first freeze the whole model
- How to then unfreeze the classifier. As it is specific to our downstream task and we completely train it.
- How to add adapters; which are all active, not frozen.
- Reviewing how the dimensions of the module’s matrix relate to the two lower rank matrices
`A`

and`B`

. - How much smaller is the number of parameters when using a small value for
`r`

?

A small excerpt below shows how the parameters of the original module `output.dense`

are not trained (marked with a `0`

), but its LoRA matrices are trainable (marked with a `1`

) and, of course, the overall classifier of the model (also marked as trainable with a `1`

):

`[..]`

roberta.encoder.layer.11.attention.output.LayerNorm.bias 0 768

roberta.encoder.layer.11.intermediate.dense.weight 0 2359296

roberta.encoder.layer.11.intermediate.dense.bias 0 3072

roberta.encoder.layer.11.output.dense.weight 0 2359296

roberta.encoder.layer.11.output.dense.bias 0 768

roberta.encoder.layer.11.output.dense.lora_A 1 12288

roberta.encoder.layer.11.output.dense.lora_B 1 3072

roberta.encoder.layer.11.output.LayerNorm.weight 0 768

roberta.encoder.layer.11.output.LayerNorm.bias 0 768

classifier.dense.weight 1 589824

classifier.dense.bias 1 768

classifier.out_proj.weight 1 1536

classifier.out_proj.bias 1 2

[..]

Total parameters: 124,978,946, thereof learnable: 923,906 (0.7392%)

Check out the notebook for more.

Further, you will see some tests in the notebook that show that the whole setup works mechanically.

But then we run our first experiment and submit the Training Jobs to SageMaker. We do a full finetuning on the original model and then a training with LoRA enabled as described here.

For our test, we train RoBERTa Large [4] on the sst-2 dataset [5] with `r`

=2 adapting the `query`

and `output`

parameters on all layers. We use `5e-5`

and `4e-4`

as learning rates for the full-finetuning and the LoRA finetuning.

That’s the result (more in the notebook):

`full-finetuning accuracy: 0.944`

lora-finetuning accuracy: 0.933

So that’s … great, not so great? What is it? First, it clearly shows that the whole setup works on a mechanical level — that’s great. And an accuracy over 90% shows that it is working well.

But how well? What do we compare these numbers to? And how representative are these two individual training runs? Were we just lucky or unlucky? The LoRA numbers are better than the traditional approach? Isn’t that strange. How well did we tune the traditional approach?

None of the above results are reliable. We don’t know if using our hyperparameters on a second run would produce similar results. Also, we used hyperparameters selected with a semi-educated guess.

There is, of course, a better way. And so in the next article we will apply a more serious approach to selecting hyperparameters and will be evaluating the performance more systematically:

- Establish baselines for comparisons
- Search good hyperparameters for both the baselines and the experiments
- Most importantly: Deepen our understanding of the LoRA method and the impact of design decisions, aligning our intuitions in a data-driven fashion

Until then, I hope you had fun reading this article.

Thanks to Constantin Gonzalez, Ümit Yoldas, Valerio Perrone and Elina Lesyk for providing invaluable feedback during the writing of this article.

All images by the author unless otherwise noted.