Introduction
Parameter-efficient fine-tuning is particularly used in the context of large-scale pre-trained models (such as in NLP), to adapt that pre-trained model to a new task without drastically increasing the number of parameters. The challenge is this: modern pre-trained models (like BERT, GPT, T5, etc.) contain hundreds of millions, if not billions, of parameters. Fine-tuning all these parameters on a downstream task, especially when the available dataset for that task is small, can easily lead to overfitting. The model may simply memorize the training data instead of learning genuine patterns. Moreover, introducing additional layers or parameters during fine-tuning can drastically increase computational requirements and memory consumption.
PEFT allows to only fine-tune a small number of model parameters while freezing most of the parameters of the pre-trained LLM. This helps overcome the catastrophic forgetting issue that full fine-tuned LLMs face where the LLM forgets the original task it was trained on after being fine-tuned.
Why?
Parameter-efficient fine-tuning is useful due the following reasons: 1) Reduced computational costs (requires fewer GPUs and GPU time). 2) Faster training times (finishes training faster). 3) Lower hardware requirements (works with cheaper GPUs with less VRAM). 4) Better modeling performance (reduces overfitting). 5) Less storage (majority of weights can be shared across different tasks).
Methods
Soft Prompt tuning
Soft prompt tuning prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be fine-tuned to incorporate signals from any number of labeled examples.
Model tuning requires making a task-specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, our tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.
Thus, instead of using discrete text prompts, prompt tuning employs soft prompts. Soft prompts are learnable and conditioned through backpropagation, making them adaptable for specific tasks.
Use Case: Soft Prompt Tuning: This method is particularly useful when there’s a need to adapt a pre-trained model to various downstream tasks without adding significant computational overhead. Since the soft prompts are learned and optimized, they can capture nuanced information necessary for the task. Prompting with Extra Context: This is often used when fine-tuning isn’t feasible or when working with models in a zero-shot or few-shot setting. It’s a way to leverage the vast knowledge contained in large pre-trained models by just guiding their behavior with carefully crafted prompts
Prefix Tuning
Proposed in Prefix-Tuning: Optimizing Continuous Prompts for Generation, prefix-tuning is a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix).
Instead of adding a soft prompt to the model input, it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
Adapters
Adapter layers, often termed “Adapters”, add minimal additional parameters to the pretrained model. These adapters are inserted between existing layers of the network.
Adapters is a PEFT technique shown to achieve similar performance as compared to tuning the top layers while requiring as fewer parameters as two orders of magnitude.
Adapter-based tuning simply inserts new modules called “adapter modules” between the layers of the pre-trained network.
During fine-tuning, only the parameters of these adapter layers are updated, while the original model parameters are kept fixed. This results in a model with a small number of additional parameters that are task-specific. Keeping the full PT model frozen, these modules are the only optimizable ones while fine-tuning this means only a very few parameters are introduced per task yielding “compact” models. They offer many benefits such as:
- Parameter-Efficiency: By keeping the main model frozen and only updating the adapter layers, a minimal number of parameters are added per task. This results in compact models that are memory-efficient.
- Performance: Despite the small parameter footprint, adapters often achieve performance comparable to conventional fine-tuning.
The adapter module consists of two fully connected layers with a bottleneck structure. This structure is inspired by autoencoders, which are designed to encode information into a compressed representation and then decode it back to its original form.
LLaMA Adapters (star)
knowledge, effectively guiding the subsequent contextual response generation.
Zero-initialized Attention Mechanism: To avoid disturbances from randomly initialized adaption prompts, which can harm fine-tuning stability and effectiveness, the paper proposes a zero-initialized attention mechanism with a learnable gating factor. This mechanism allows for a stable learning process and progressive incorporation of instructional signals during training. It ensures that the newly acquired instructional signals are effectively integrated into the transformer while retaining the pre-trained knowledge of LLaMA.
Reparameterization - LoRA
Rank of a matrix = a measure of the number of linearly independent rows or columns in the matrix.
The rank of a tensor refers to the number of dimensions in the tensor.
Introducing two trainable low-rank matrices, A
and B
, which capture the adaptation. A
is initialized as a random Gaussian matrix and B
as a zero matrix. During training, the product of A
and B
forms a low-rank update matrix that is added to the original, pre-trained weight matrix W
to produce the adapted model output h
. This approach allows for efficient adaptation by modifying only a small subset of parameters while keeping the pre-trained weights W frozen.
Limitations
While LoRA offers significant advantages in terms of parameter efficiency and memory savings, it also has some limitations. One notable limitation is the complexity involved in batching inputs for different tasks when using distinct low-rank matrices A
and B
. If the goal is to absorb A
and B
into the combined weight matrix W
to avoid additional inference latency, it becomes challenging to batch inputs from different tasks in a single forward pass. This is because each task would require a different set of A
and B
matrices, complicating the batching process.
Additionally, although it is possible to avoid merging the weights and dynamically select the appropriate LoRA modules for each sample in a batch, this approach is more feasible in scenarios where latency is not a critical concern. This workaround does not fully address the need for seamless integration when low-latency inference is required across multiple tasks.
For more information refer to Aman’s notes.
QLoRA
QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
At a top-level, QLoRA operates based on the following steps:
- Quantize the pre-trained model to 4 bits and freeze it.
- Attach small, trainable adapter layers (similar to LoRA).
- Finetune only the adapter layers while using the frozen quantized model for context.
DoRA
Weight-Decomposed Low-Rank Adaptation (DoRA) is a novel Parameter-Efficient Fine-Tuning (PEFT) method that surpasses existing techniques like LoRA by decomposing pre-trained weights into magnitude and directional components for efficient fine-tuning. This method is designed to bridge the accuracy gap between LoRA-based methods and full fine-tuning, without increasing inference costs.
LoReFT
Representation Finetuning (ReFT) is a suite of methods to modify the hidden representations of language models (LMs) for task-specific adaptation. Unlike traditional parameter-efficient finetuning (PEFT) methods that adapt by modifying weights, ReFT manipulates a small fraction of model representations, enhancing the interpretability and flexibility of the interventions.
A key variant within ReFT, named Low-rank Linear Subspace ReFT (LoReFT), leverages a low-rank projection matrix to edit representations in a linear subspace. This approach is demonstrated to be 10× –50× more parameter-efficient compared to existing state-of-the-art PEFTs like LoRA.
The figure below from the paper shows an illustration of ReFT. (1) The left panel depicts an intervention I: the intervention function Φ
is applied to hidden representations at positions P
in layer L
. (2) The right panel depicts the hyperparameters we tune when experimenting with LoReFT. Specifically, the figure depicts application of LoReFT at all layers with prefix length p
= 2 and suffix length s
= 2. When not tying layer weights, we train separate intervention parameters at each position and layer, resulting in 16 interventions with unique parameters in this example.
BitFit
BitFit’s methodology leverages the observation that fine-tuning often doesn’t require extensive retraining of all parameters. Instead, fine-tuning only the bias terms achieves competitive results compared to full model fine-tuning, especially with small to medium-sized datasets. In scenarios permitting slight performance degradation, the method can be constrained to adjust only two specific types of bias terms, representing just 0.04% of the total model parameters. Implementation details include freezing the transformer-encoder’s main weights and training only the bias terms along with a task-specific classification layer. This approach allows the model to handle multiple tasks efficiently in a streaming fashion without requiring simultaneous access to all task datasets.
References
https://aman.ai/primers/ai/parameter-efficient-fine-tuning/
https://lightning.ai/lightning-ai/studios/code-lora-from-scratch?section=featured