Overcoming Generic Knowledge Loss with Selective Parameter Update

Wenxuan Zhang¹, Paul Janson^1,2, Rahaf Aljundi³ Mohamed Elhoseiny¹

¹KAUST, ²Concordia University, ³Toyota Motor Europe
CVPR 2024

We continually update the foundation with incoming data while efficiently preserve the pre-obtained ability to generalize by only identifying and optimizing the parameters relevant to the task in hand. To achieve this, we take the data from task in hand to identify the parameters in foundation models that are relevant, and then update only those parameters.

The principles of updating foundation models

We seek to develop an efficient approach that can continually update foundation models to accommodate novel information. Optimally, this approach shall not require large scale retraining, heavy computational resources, and extra parameter addition. And the output model should be able to generalize well as the original foundation model.

Why and how to perform selective update

Starting from a large model pre-trained on vast sources of data, it is reasonable to assume that the model has some kind of basic or related knowledge on the new upcoming data. Thus, we hypothesize that there is an implicit modularity in the foundation model. With the thriving for efficiency and the preservation of the generic knowledge, we suggest identifying a small set of parameters corresponding to tasks in hand and only updating them instead of modifying all the pre-trained model parameters.

Localization

To start with, we restrict the changes to specific layers in the pre-trained transformer backbones. Inspired by model editing (ROME, MEMIT), we perform casual tracking to identify which transformer component (figure left) contributes most to the final performance. We observe that the changes on the first MLP layer, that we localize the update to, have a larger effect on the model predictions than changes in attention layers, as shown in figure right.

Sparse Update

We propose to update a subset of parameters in the first MLP layer for each transformer block that is specialized on the task at hand before training. To perform the selection, we define a scoring function that measures the importance of each parameter in the first MLP layer by the accumulation of the gradient magnitude. We then select the top 10% percent of the parameters with the highest scores to update. In practice, we pre-compute the gradient mask for the selected parameters and apply it to the gradient during training.

Quantitative Results

We compare our algorithm with the standard fine-tuning methods, continual learning methods, and the parameter efficient fine-tuning methods on both fine-grained and coarse grained tasks. Figure below shows the average accuracy gain of the methods over all dataset and the average accuracy drop on the held out dataset (ImageNet 1k). Our method achieves the best performance gain while almost has no accuracy drop on the held out dataset. results

More results and ablations

Fourth image description.

BibTeX


        @inproceedings{zhang2024overcoming,
          title={Overcoming Generic Knowledge Loss with Selective Parameter Update},
          author={Zhang, Wenxuan and Janson, Paul and Aljundi, Rahaf and Elhoseiny, Mohamed},
          booktitle={CVPR},
          year={2024}
        }