What is LoRA
Training large language models is expensive. LoRA offers a shortcut: instead of updating billions of parameters, you train two small matrices that together approximate the update you need. Here is how that works, step by step.
Table of Contents 1
The Problem LoRA Solves
Large language models like GPT or LLaMA have billions of parameters. When you fine-tune such a model the traditional way, you update every single one of those parameters during training. That means you need enough GPU memory to hold the entire model plus all the optimizer states and gradients for every parameter.
For a 7-billion parameter model, full fine-tuning can easily require 4 to 8 high-end GPUs. For models at the 70B scale, the hardware costs become prohibitive for most teams.
The core question LoRA answers is: do we really need to update all those billions of parameters, or can we get away with updating far fewer?
The original LoRA paper2Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (2021). The paper showed that weight updates during fine-tuning have a low intrinsic rank, meaning they can be approximated well by small matrices. demonstrated something interesting: the weight changes that happen during fine-tuning tend to have a low intrinsic rank. In plain language, that means the meaningful updates can be captured by matrices that are much smaller than the original weight matrices.
Thinking in Matrices
To understand how LoRA works, it helps to think about what a neural network layer actually is. When I first started learning about LLMs, in my mind, I visualized layers like this:
A simplified view of how data flows through a neural network, with weights on the connections.
This visualization shows individual nodes and connections. But to understand LoRA, it is more useful to think of each layer as a matrix of weights. Let us simplify and represent a single layer as a 4×4 matrix:
A weight matrix represented as a 4×4 grid.
In a real model, this matrix might be 4096×4096, containing about 16 million parameters. During full fine-tuning, you would update all 16 million values. That is where LoRA comes in.
The Low-Rank Decomposition
Instead of directly modifying all 16 weights in our example matrix, LoRA creates two smaller matrices that together can represent an update to the original. The key insight is that we can decompose a large matrix into the product of two much smaller ones.
For our 4×4 example, we might create:
- Matrix A with dimensions 1×4 (4 parameters)
- Matrix B with dimensions 4×1 (4 parameters)
Two small LoRA matrices. Together they have only 8 parameters instead of 16.
The trick is: when we multiply these two matrices together, we get a matrix that is the same size as our original 4×4 weight matrix:
Multiplying B (4×1) by A (1×4) produces a full 4×4 matrix that we can add to the original weights.
This resulting matrix gets added to the original weight matrix, scaled by a factor that controls how much influence the LoRA update has.
The Formula
The complete LoRA update is expressed as:
Or more precisely, with dimensions shown:
Where:
- is the final weight matrix used during inference
- is the original pre-trained weight matrix (frozen, never modified)
- is the effective scaling factor. Dividing by the rank keeps the learning rate roughly independent of rank, so you can change without retuning other hyperparameters
- and are the two small LoRA matrices that get trained
- and are the dimensions of the original matrix
- is the rank of the LoRA decomposition (typically much smaller than and )
The rank is the key parameter. In our simplified example, . In practice, typical values are 4, 8, 16, or 32. Even with , you are training a tiny fraction of the original parameters.
The Numbers Tell the Story
Let us scale this up to realistic dimensions. A weight matrix in a typical transformer layer might be 4096×4096:
| Approach | Parameters | Percentage |
|---|---|---|
| Full fine-tuning | 16,777,216 | 100% |
| LoRA (rank 4) | 32,768 | 0.2% |
| LoRA (rank 8) | 65,536 | 0.4% |
| LoRA (rank 16) | 131,072 | 0.8% |
| LoRA (rank 64) | 524,288 | 3.1% |
Even at rank 64, you are training less than 4% of the parameters. At rank 8, it is less than half a percent.
You Still Use the Full Model
One thing that can be confusing: during LoRA fine-tuning, you still run the full model for every training step. The forward pass goes through all the original weights. The difference is in what gets updated during the backward pass.
With full fine-tuning, the backward pass computes gradients for every parameter and updates all of them. With LoRA, gradients are only computed and applied to the small A and B matrices. The original weights remain frozen.
During inference, the input flows through both the frozen weights and the LoRA adapter. Their outputs are summed.
This is where the efficiency comes from. The forward pass costs the same either way. But the optimizer only needs to track states for the LoRA parameters. For a 7 billion parameter model with LoRA matrices totaling a few million parameters, the optimizer memory drops by orders of magnitude.
At inference time, you can merge the LoRA weights back into the base model using the formula above. The result is a single weight matrix with no additional inference latency, but you got there with a fraction of the training cost.
Practical Implications
The parameter reduction has cascading benefits:
Memory efficiency. Since you only compute gradients for the adapter matrices, you need far less GPU memory. Fine-tuning a 7B model becomes possible on a single consumer GPU.
Adapter swapping. Because the base model stays frozen, you can train multiple LoRA adapters for different tasks and swap them at inference time. One adapter for medical text, another for legal documents, a third for code generation - all sharing the same base model weights.
Small file sizes. A LoRA adapter for a 7B model might be 10-50 MB, compared to 14+ GB for the full model. This makes sharing and distributing fine-tuned models far more practical.
Reduced catastrophic forgetting. Full fine-tuning risks overwriting knowledge the model learned during pretraining. Because LoRA keeps the original weights frozen, the base model’s knowledge is largely preserved. The adapter adds to it rather than replacing it.
The Trade-off
LoRA is not without limitations. By constraining updates to a low-rank subspace, you are making an assumption that the needed changes can be expressed in that subspace. For most fine-tuning tasks, this assumption holds well. But for tasks that require large, complex changes to model behavior, full fine-tuning may still produce better results.
The rank parameter is the main knob for controlling this trade-off. Higher rank means more expressiveness but also more parameters to train and more memory. In practice, ranks between 4 and 64 cover most use cases, with 8 or 16 being common starting points.
Wrapping Up
LoRA offers an efficient path to fine-tuning large language models by exploiting the low-rank nature of weight updates. Instead of modifying billions of parameters, you train two small matrices per layer and merge the result back into the original weights. The math is straightforward, the savings are dramatic, and the quality holds up well for most tasks.
It has become the default approach for fine-tuning in practice, and for good reason.
- Jan Willem
Related articles
- 4m
Tokenization and Byte Pair Encoding
LLMs do not read words — they read tokens. Here is how tokenization works and how the BPE algorithm builds a vocabulary that balances efficiency with coverage. Read → - 11m
Making Claude Code continuously improve itself
A continuous improvement system for Claude Code that extracts insights from coding sessions and turns them into concrete harness improvements. Read →