LoRA is used for adapting pretrained models in an efficient way. In standard fine-tuning, model weights \(W\) are updated. LoRA instead learns \(\Delta W\) and updates the model weights as \(W + \Delta W\) while keeping \(W\) fixed. Additionally, instead of tuning \(\Delta W\), which is the same size as \(W\), LoRA learns two lower-rank matrices as \(\Delta W \approx A \cdot B\).

LoRA has two main advantages:

  • For different downstream tasks, different \(\Delta W\) weights can be stored, while \(W\) is fixed.
  • Like the classical problem of low-rank matrix factorization, \(A\) and \(B\) are of smaller rank than \(W\). Let’s assume \(W\) is \(10 \times 10\), while \(A\) and \(B\) can be like \(10 \times 3\) and \(3 \times 10\), respectively. Instead of 100 parameters, we have 30 + 30 parameters to learn.

Neat.

References

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, “LoRA: Low-Rank Adaptation of Large Language Models”, arXiv:2106.09685, 2021.