Dive Into LoRA Adapters. Exploring Parameter Efficient… | by Mariano Kamp | Aug, 2023


Let’s check out some code excerpts for our small, illustrative example. You find the full code in the accompanying notebook and a more complete implementation that we use in the following articles is in the same repository.

Let’s start with how we setup an adapter. We pass in a reference to the module to be adapted, which we now call the adaptee. We store a reference to its original forward method and let the adaptee’s forward method now point to the adapter’s forward method’s implementation.

class LoRAAdapter(nn.Module):
def __init__(self,
adaptee, # <- module to be adapted
r):
super().__init__()

self.r = r
self.adaptee = adaptee

# Store a pointer to the original forward implementation
# of the module to be adapted.
# Then point its forward method to this adapter module.
self.orig_forward = adaptee.forward
adaptee.forward = self.forward
[..]

Now that we have setup the mechanics of the integration, we also initialize the parameters of our low rank matrices. Recognize that we initialize one matrix with 0 and one randomly:

        [..]
# Adding the weight matrices directly to the adaptee,
# which makes is more practical to report the parameters,
# and to remove it later.
adaptee.lora_A = (nn.Parameter(torch.randn(adaptee.in_features, r)/
math.sqrt(adaptee.in_features)))
adaptee.lora_B = nn.Parameter(torch.zeros(r, adaptee.out_features))

And finally, still part of the LoRAAdapter class, we have our forward method that first calls the adaptee’s forward method with our input x. That is the original path executed in the original module. But we then also add that result to that from our adapted branch, where we matrix multiply the input x with A and B.

def forward(self, x, *args, **kwargs):
return (
self.orig_forward(x, *args, **kwargs) +
x @ self.adaptee.lora_A @ self.adaptee.lora_B
)

This simplicity looks elegant to my eye.

There are more details that could be interesting, but are best explained alongside code. You find these in the accompanying notebook:

  • How to first freeze the whole model
  • How to then unfreeze the classifier. As it is specific to our downstream task and we completely train it.
  • How to add adapters; which are all active, not frozen.
  • Reviewing how the dimensions of the module’s matrix relate to the two lower rank matrices A and B.
  • How much smaller is the number of parameters when using a small value for r?

A small excerpt below shows how the parameters of the original module output.dense are not trained (marked with a 0 ), but its LoRA matrices are trainable (marked with a 1) and, of course, the overall classifier of the model (also marked as trainable with a 1):

[..]
roberta.encoder.layer.11.attention.output.LayerNorm.bias 0 768
roberta.encoder.layer.11.intermediate.dense.weight 0 2359296
roberta.encoder.layer.11.intermediate.dense.bias 0 3072
roberta.encoder.layer.11.output.dense.weight 0 2359296
roberta.encoder.layer.11.output.dense.bias 0 768
roberta.encoder.layer.11.output.dense.lora_A 1 12288
roberta.encoder.layer.11.output.dense.lora_B 1 3072
roberta.encoder.layer.11.output.LayerNorm.weight 0 768
roberta.encoder.layer.11.output.LayerNorm.bias 0 768
classifier.dense.weight 1 589824
classifier.dense.bias 1 768
classifier.out_proj.weight 1 1536
classifier.out_proj.bias 1 2
[..]
Total parameters: 124,978,946, thereof learnable: 923,906 (0.7392%)

Check out the notebook for more.

Further, you will see some tests in the notebook that show that the whole setup works mechanically.

But then we run our first experiment and submit the Training Jobs to SageMaker. We do a full finetuning on the original model and then a training with LoRA enabled as described here.

For our test, we train RoBERTa Large [4] on the sst-2 dataset [5] with r=2 adapting the query and output parameters on all layers. We use 5e-5 and 4e-4 as learning rates for the full-finetuning and the LoRA finetuning.

That’s the result (more in the notebook):

full-finetuning accuracy: 0.944
lora-finetuning accuracy: 0.933

So that’s … great, not so great? What is it? First, it clearly shows that the whole setup works on a mechanical level — that’s great. And an accuracy over 90% shows that it is working well.

But how well? What do we compare these numbers to? And how representative are these two individual training runs? Were we just lucky or unlucky? The LoRA numbers are better than the traditional approach? Isn’t that strange. How well did we tune the traditional approach?

None of the above results are reliable. We don’t know if using our hyperparameters on a second run would produce similar results. Also, we used hyperparameters selected with a semi-educated guess.

There is, of course, a better way. And so in the next article we will apply a more serious approach to selecting hyperparameters and will be evaluating the performance more systematically:

  • Establish baselines for comparisons
  • Search good hyperparameters for both the baselines and the experiments
  • Most importantly: Deepen our understanding of the LoRA method and the impact of design decisions, aligning our intuitions in a data-driven fashion

Until then, I hope you had fun reading this article.

Thanks to Constantin Gonzalez, Ümit Yoldas, Valerio Perrone and Elina Lesyk for providing invaluable feedback during the writing of this article.

All images by the author unless otherwise noted.

[1] Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, 2020

[2] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, 2021

[3] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu Qiao. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, 2023

[4] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019

[5] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, 2013



Source link

Leave a Comment