We seek to develop an efficient approach that can continually update foundation models to accommodate novel information. Optimally, this approach shall not require large scale retraining, heavy computational resources, and extra parameter addition. And the output model should be able to generalize well as the original foundation model.
Starting from a large model pre-trained on vast sources of data, it is reasonable to assume that the model has some kind of basic or related knowledge on the new upcoming data. Thus, we hypothesize that there is an implicit modularity in the foundation model. With the thriving for efficiency and the preservation of the generic knowledge, we suggest identifying a small set of parameters corresponding to tasks in hand and only updating them instead of modifying all the pre-trained model parameters.
To start with, we restrict the changes to specific layers in the pre-trained transformer backbones. Inspired by model editing (ROME, MEMIT), we perform casual tracking to identify which transformer component (figure left) contributes most to the final performance. We observe that the changes on the first MLP layer, that we localize the update to, have a larger effect on the model predictions than changes in attention layers, as shown in figure right.
We propose to update a subset of parameters in the first MLP layer for each transformer block that is specialized on the task at hand before training. To perform the selection, we define a scoring function that measures the importance of each parameter in the first MLP layer by the accumulation of the gradient magnitude. We then select the top 10% percent of the parameters with the highest scores to update. In practice, we pre-compute the gradient mask for the selected parameters and apply it to the gradient during training.
We compare our algorithm with the standard fine-tuning methods, continual learning methods, and the parameter efficient fine-tuning methods on both fine-grained and coarse grained tasks. Figure below shows the average accuracy gain of the methods over all dataset and the average accuracy drop on the held out dataset (ImageNet 1k). Our method achieves the best performance gain while almost has no accuracy drop on the held out dataset.
@inproceedings{zhang2024overcoming,
title={Overcoming Generic Knowledge Loss with Selective Parameter Update},
author={Zhang, Wenxuan and Janson, Paul and Aljundi, Rahaf and Elhoseiny, Mohamed},
booktitle={CVPR},
year={2024}
}