Xinghao Wu1,∗, Jianwei Niu1,2, Xuefeng Liu, Mingjia Shi3,
Guogang Zhu1, and Shaojie Tang4
1School of Computer Science and Engineering, Beihang University, Beijing, China
2Zhongguancun Laboratory, Beijing, China
3 Sichuan University, Sichuan, China
4Department of Management Science and Systems, University at Buffalo
Abstract
In traditional Federated Learning approaches like FedAvg, the global model underperforms when faced with data heterogeneity. Personalized Federated Learning (PFL) enables clients to train personalized models to fit their local data distribution better. However, we surprisingly find that the feature extractor in FedAvg is superior to those in most PFL methods. More interestingly, by applying a linear transformation on local features extracted by the feature extractor to align with the classifier, FedAvg can surpass the majority of PFL methods. This suggests that the primary cause of FedAvg’s inadequate performance stems from the mismatch between the locally extracted features and the classifier. While current PFL methods mitigate this issue to some extent, their designs compromise the quality of the feature extractor, thus limiting the full potential of PFL.In this paper, we propose a new PFL framework called FedPFT to address the mismatch problem while enhancing the quality of the feature extractor. FedPFT integrates a feature transformation module, driven by personalized prompts, between the global feature extractor and classifier. In each round, clients first train prompts to transform local features to match the global classifier, followed by training model parameters. This approach can also align the training objectives of clients, reducing the impact of data heterogeneity on model collaboration. Moreover, FedPFT’s feature transformation module is highly scalable, allowing for the use of different prompts to tailor local features to various tasks. Leveraging this, we introduce a collaborative contrastive learning task to further refine feature extractor quality. Our experiments demonstrate that FedPFT outperforms state-of-the-art methods by up to 7.08%.
1 Introduction
Federated Learning (FL) allows all clients to train a global model collaboratively without sharing their raw data. A key challenge in FL is data heterogeneity, meaning the data across different clients is not independently and identically distributed (non-IID). This issue results in degraded performance of the global model trained in conventional FL methods such as FedAvg [27].
To address this issue, Personalized Federated Learning (PFL) has been proposed, which allows clients to train personalized models to fit their local data distribution better. Many current PFL methods achieve personalization by personalizing some parameters of the global model. For example, FedPer [2] personalizes classifiers, FedBN [20] personalizes BN layers, AlignFed [39] personalizes feature extractors, and FedCAC [35] selects parameters susceptible to non-IID effect for personalization.
CIFAR-10, | CIFAR-10, | |||||
Methods | Probe Acc. | Origin Acc. | Match Acc. | Probe Acc. | Origin Acc. | Match Acc. |
FedAvg | 72.52% | 59.66% | 72.60% | 68.38% | 60.33% | 68.37% |
FedPer | 71.07% | 68.86% | 71.03% | 66.51% | 64.83% | 66.75% |
FedBN | 70.15% | 66.20% | 70.60% | 66.51% | 62.97% | 66.80% |
FedCAC | 71.56% | 68.71% | 71.63% | 66.98% | 64.90% | 67.11% |
Ours | 77.25% | 77.06% | 77.68% | 74.02% | 73.88% | 74.75% |
Although the above methods demonstrate significant performance improvements over the global model, an interesting observation emerged from our experiments: the feature extractor derived from FedAvg outperforms those in most PFL methods. Specifically, we conduct linear probe experiments in which each client employs a randomly initialized linear classifier (probe) behind the feature extractor, and this classifier is subsequently retrained. As evident from Table1, the Probe Acc. of FedAvg exceeds that of the PFL methods, indicating that the features extracted by FedAvg exhibit superior linear separability. This suggests that FedAvg has greater potential to outperform PFL methods.
These findings prompt us to further explore why FedAvg underperforms on client-local data compared to PFL methods. To unveil this puzzle, we introduce a linear layer between the global feature extractor and the classifier on each client, training this layer with local data to align the features with the classifier. According to the Match Acc. in Table1, applying a linear transformation to local features significantly improves accuracy over the original model (Origin Acc.), even exceeding the Origin Acc. of current PFL methods. This indicates that the fundamental reason for the global model’s inadequate performance lies in the mismatch between local features and the global classifier.
Further experiments with PFL methods demonstrate that while they somewhat mitigate the mismatch issue, their design inadvertently degrades the quality of the feature extractor, leading to a lower Match Acc. compared to FedAvg. More importantly, current PFL methods still face issues of mismatch. This problem not only diminishes model accuracy during inference but also affects the synergy between the feature extractor and the classifier during training, ultimately impacting the feature extractor’s quality. These observations suggest that significant untapped potential remains within PFL.
In PFL, targeted designs are imperative to tackle the mismatch problem during training and improve the quality of the feature extractor. Hence, we introduce a novel PFL method called FedPFT. Drawing inspiration from prompt technology [11], which utilizes prompts as inputs to guide a model’s behavior, FedPFT integrates a vision-prompt-driven feature transformation module between the global feature extractor and classifier. In each iteration, FedPFT initially trains prompts to guide local feature transformation to align with the global classifier. This process aligns the local features with the global feature space partitioned by the classifier, thereby achieving alignment of training objectives among clients. Subsequently, training the model parameters based on this alignment can alleviate the impact of non-IID data on client collaboration and enhance the quality of the feature extractor.
Furthermore, our proposed framework exhibits notable scalability. Clients’ local features can be transformed by different task-specific prompts to accommodate various tasks. Leveraging this capability, we introduce a collaborative contrastive learning task among clients to further enhance the quality of the feature extractor. As evidenced in Table 1, our method not only resolves the mismatch issue but also significantly improves the quality of the feature extractor.
Our main contributions can be summarized as follows:
- •
We identify the root cause of the inadequate performance of the global model stemming from the mismatch between local features and the classifier. The reason personalizing some parameters can improve performance is that it alleviates the impact of this issue. This provides a new perspective for future PFL approaches to better address the non-IID problem.
- •
We propose a new PFL framework, which incorporates a feature transformation module to align local features with the global classifier. This approach not only resolves the mismatch problem but also significantly enhances the performance of the feature extractor.
- •
Our experiments on multiple datasets and non-IID scenarios demonstrate the superiority of FedPFT, outperforming state-of-the-art methods by up to 7.08%.
2 Related Work
PFL is a kind of effective approach to address the challenges of non-IID data in FL. There is a surge of methodologies within PFL, with parameter decoupling methods gaining significant attention due to their simplicity and effectiveness, thus becoming one of the mainstream research directions in PFL. For a more detailed summary of other categories of PFL methods, please refer to AppendixA.
Parameter decoupling methods aim to decouple a subset of parameters from the global model for personalization. Approaches such as FedPer [2], FedRep [5], and GPFL [38] focus on personalizing the classifier. In contrast, methods like LG-FedAvg [22], and AlignFed [39] advocate for the personalization of the feature extractor. Additionally, FedBN [20] and MTFL [28] propose personalizing batch normalization (BN) layers within the feature extractor. Techniques employing deep reinforcement learning [31] or hypernetworks [26] have been used to determine which specific layers to personalize. The recent FedCAC [35] method advances this by introducing a metric for parameter-wise selection.
These decoupling methods help alleviate the mismatch issue within the global model by allowing local parameter adjustments. For instance, personalized classifiers involve local adjustments to the classifier to match it with the local features extracted by the global feature extractor. However, these methods do not completely resolve the mismatch issue during training. Personalizing parameters often reduce the extent of client information exchange, which can diminish the overall quality of the feature extractor, thus limiting the potential benefits of PFL.
3 Methodology
3.1 Overview of FedPFT
Before delving into the details of FedPFT, we first provide an overview, as illustrated in Figure1(a). Each training round in FedPFT includes five key steps:
(1) Clients download the global models, which include the feature extractor , feature transformation module , classifier , and feature projection head . These models serve to initialize the corresponding local models .
(2) Each client updates the , and to minimize contrastive learning loss , aiming to enhance the generalization of the feature extractor. It also updates and classification prompts with the cross-entropy loss to align local features with the global classifier.
(3) Each client freezes the prompts and trains , , and using to adapt the model to the classification task. It also makes the contrastive learning prompts trainable to align features with the contrastive learning task.
(4) Clients upload to the server while retaining locally.
(5) The server aggregates the models uploaded by the clients.
3.2 Problem Formulation
In PFL, clients train their personalized models under the coordination of a server, aiming for each to perform well on client data distribution . This objective can be formalized as , where represents the loss function of the -th client.
In this paper, our goal is to enhance personalized models by addressing the mismatch problem between local features and the classifier in the global model and improving the quality of the feature extractor. Thus, the training objective of FedPFT can be formulated as:
(1) |
where and represent the feature extractor and classifier of the global model, respectively. is the newly introduced global feature transformation module. This module, along with the classification prompt , transforms local features to align with the global classifier. denotes the cross-entropy loss for classification tasks, while represents the contrastive learning loss designed to enhance the feature extractor’s quality. represents the local data of the client.
3.3 Feature Transformation Module
In FedPFT, we introduce a global feature transformation module , along with a set of prompts for each client , to align the features extracted by the feature extractor with the global classifier.
Formally, given a sample , extracted by the feature extractor , the obtained feature is , where is the feature dimension. A collection of prompts is denoted as . The operation of the feature transformation module is formulated as
(2) |
where signifies stacking and concatenation along the sequence length dimension, yielding . The represents the transformed feature. An example of the feature transform module is illustrated in Figure1(b).
The feature transformation module essentially adapts local features for downstream tasks, providing good scalability. We can introduce tasks beneficial for client collaboration using different task-specific prompts . Leveraging this, FedPFT additionally introduces a contrastive learning task and utilizes contrastive learning prompts for feature transformation. We denote and as the number of prompts contained in and , respectively.
3.4 Classification Task with Personalized Prompts
The classifier is highly susceptible to the influence of non-IID data, leading to a mismatch between the global classifier and local features. Different from the previous methods, which personalize the classifier to match local features, we find that using a global classifier provides clients with a unified feature partition space. Clients aligning features with this space not only solves the mismatch problem but also aligns training objectives among clients, reducing the impact of non-IID on collaboration.
To implement this, we retain the global feature extractor and classifier while employing a set of personalized classification prompts to transform each client ’s local features to better align with the global classifier. Specifically, the classification loss in each client is defined as:
(3) |
is the number of classes, and represents the predicted probabilities, with being the ones of class . Details on coordinating the training of the model and prompts to achieve feature and classifier alignment are discussed in Section3.6.
3.5 Contrastive Learning Task
Contrastive learning tasks have shown robustness to the challenges posed by non-IID data distributions [34]. To further enhance the quality of the model’s feature extractor, we introduce a contrastive learning task using the Momentum Contrast (MoCo) [8] framework. The associated contrastive loss function is defined as:
(4) |
is the projection head used for contrastive learning. In this formula, represents the query vector, and denotes the positive key vector. Here, and are augmented versions of the sample , and refer to the momentum-updated encoder and projection head, respectively. is a temperature hyperparameter, and is the number of negative samples drawn from MoCo’s queue, comprising the set .
3.6 Alternating Training Strategy
To effectively coordinate the training of different modules in FedPFT, we propose an alternating training strategy, which partitions each local training round into two distinct phases: the feature learning phase and the task adaptation phase.
Feature Learning Phase.
In this phase, the training objective can be formulated as
(5) |
trains the classification prompts , aligning local features with the global classifier , while is aimed at training the feature extractor to derive general feature representations and mitigate the impact of non-IID data. Notably, during this phase, is exclusively updated by .
Task Adaptation Phase.
Following the above phase, this phase refines the previously learned features for the classification task and further enhances the classifier. The training objective is
(6) |
The second item in Eq.(6) focuses on training the contrastive learning prompts , aiming to align the feature with the contrastive learning task and mitigate the impact of the classification task on the contrastive learning task. In this phase, is updated solely by .
Let represent the total number of local epochs in one training round. We divide into epochs for the feature learning phase and epochs for the task adaptation phase, where . It is crucial that is always larger than to ensure: 1) is predominantly trained by the contrastive learning task, reducing the impact of the non-IID problem on collaboration in feature extraction; 2) Improved alignment of features utilized for classification at the client side with the global classifier, thereby achieving better alignment of training objectives across clients.
Upon completing local training, the parameters , , , and are aggregated at the server to foster client collaboration, while and remain locally. We simply adopt the aggregation method used in FedAvg. The pseudo-code of FedPFT is summarized in Algorithm1.
4 Experiments
4.1 Experimental Setup
Datasets.
We employ three datasets for experimental validation: CIFAR-10 [14], CIFAR-100 [13], and Tiny ImageNet [15]. We utilize two scenarios: Dirichlet non-IID and Pathological non-IID.
In our experiments, each client is assigned 500 training samples. For CIFAR-10 and CIFAR-100 datasets, each client has 100 test samples; for the Tiny ImageNet dataset, each client has 200 test samples. Both training and test data have the same label distribution.
Baseline Methods. We compare our method against nine state-of-the-art (SOTA) methods: FedAMP [10], Fedper [2], FedRep [5], FedBN [20], FedRoD [4], pFedSD [12], pFedGate [3], FedCAC [35], and pFedPT [17]. These methods cover the advancements in mainstream PFL research directions.
Hyperparameter Settings. For the general hyperparameters of FL, we set the number of clients , Batch Size , and local update rounds . Across all datasets, we set the total rounds in each experiment to ensure convergence and select the highest average accuracy achieved by all clients across all rounds as the result. Each experiment is repeated with three random seeds, and the mean and standard deviation are reported. We employ the ResNet [9] model architecture, specifically ResNet-8 for CIFAR-10 and ResNet-10 for CIFAR-100 and Tiny ImageNet.
For more details on the experimental setup, please refer to AppendixB.
4.2 Comparison with State-of-the-art Methods
CIFAR-100 | Tiny ImageNet | |||||
Methods | ||||||
FedAvg | 34.910.86 | 32.780.23 | 33.940.39 | 21.261.28 | 20.320.91 | 17.200.54 |
Local | 47.610.96 | 22.650.51 | 18.760.63 | 24.070.62 | 8.750.30 | 6.870.28 |
FedAMP | 46.681.06 | 24.740.58 | 18.220.41 | 27.850.71 | 10.700.32 | 7.130.21 |
FedPer | 51.380.94 | 28.251.03 | 21.530.50 | 32.330.31 | 12.690.42 | 8.670.40 |
FedRep | 51.251.37 | 26.970.33 | 20.630.42 | 30.831.05 | 12.140.28 | 8.370.25 |
FedBN | 54.350.63 | 36.940.94 | 33.670.12 | 33.340.71 | 19.610.35 | 16.570.44 |
FedRoD | 60.170.48 | 39.881.18 | 36.800.56 | 41.060.77 | 25.631.11 | 22.321.13 |
pFedSD | 54.140.77 | 41.060.83 | 38.270.20 | 39.310.19 | 19.251.80 | 15.910.33 |
pFedGate | 48.540.39 | 27.470.79 | 22.980.03 | 37.590.39 | 24.090.67 | 19.690.14 |
FedCAC | 57.221.52 | 38.640.63 | 32.590.32 | 40.191.20 | 23.700.28 | 18.580.62 |
pFedPT | 43.211.66 | 35.230.87 | 36.250.37 | 23.550.68 | 22.350.49 | 21.690.24 |
FedPFT w/o | 60.980.39 | 44.870.76 | 41.830.37 | 41.490.10 | 28.610.40 | 25.100.59 |
FedPFT | 62.031.41 | 47.980.78 | 44.290.74 | 43.421.62 | 32.440.58 | 27.840.41 |
In this section, we compare our proposed FedPFT with two baseline methods and nine SOTA methods across three datasets and two non-IID scenarios. We also introduce ‘FedPFT w/o ,’ which solely addresses the mismatch problem without contrastive learning. The experimental results on CIFAR-100 and Tiny ImageNet in Dirichlet non-IID scenario are presented in Table2.Please refer to the AppendixC for experimental results in Pathological non-IID scenarios and the CIFAR-10 dataset.
Results in Dirichlet non-IID scenario.
In this setting, by varying , we can evaluate the performance of methods under different non-IID degrees. The results, as detailed in Table2, demonstrate that performance varies significantly depending on the underlying design principles of each method. Among all methods, FedRoD demonstrates robust performance across all datasets and non-IID degrees. This is attributed to its design of two classifiers: a personalized classifier for local feature alignment and a global classifier for assistance from other clients to improve generalization. ‘FedPFT w/o ’ addresses the mismatch issue specifically and achieves competitive or superior results across all scenarios. FedPFT further improves feature extractor quality and outperforms SOTA methods significantly across all scenarios, achieving up to a 7.08% improvement.
4.3 Ablation Study
In this section, we validate the effectiveness of each component of FedPFT on the CIFAR-100 dataset under two non-IID degrees. The experimental results are illustrated in Table3.
Settings | Alter. | Accuracy (%) | Alter. | Accuracy (%) | ||||||
I | 33.871.35 | 30.090.31 | ||||||||
II | ✓ | 40.971.28 | ✓ | 31.451.35 | ||||||
III | ✓ | ✓ | 60.980.39 | ✓ | ✓ | 44.870.76 | ||||
IV | ✓ | ✓ | ✓ | 61.130.50 | ✓ | ✓ | ✓ | 47.671.42 | ||
V | ✓ | ✓ | ✓ | ✓ | 62.031.41 | ✓ | ✓ | ✓ | ✓ | 47.980.78 |
VI | ✓ | 36.241.10 | ✓ | 34.701.33 | ||||||
VII | ✓ | ✓ | 53.170.58 | ✓ | ✓ | 38.900.91 | ||||
VIII | ✓ | ✓ | ✓ | 53.760.35 | ✓ | ✓ | ✓ | 39.291.00 |
Setting I represents FedAvg. Setting II incorporates classification prompts to allow each client to adjust the global model individually to obtain a personalized model, resulting in a performance improvement. Setting III incorporates alternating training, where prompts are firstly updated to align local features with the global classifier, followed by training model parameters. This approach essentially aligns training objectives among clients. This effectively mitigates the impact of non-IID data on model collaboration, thus further enhancing the quality of the global model.
Setting IV adds contrastive learning loss to Setting III, , focusing primarily on enhancing the feature extractor’s performance through contrastive learning techniques. Setting V incorporates specific prompts for the contrastive learning task. This reduces mutual interference between the two tasks during training, especially effective when non-IID is strong (e.g., ).
Setting VI illustrates that adding contrastive learning alone brings very limited improvements. Settings VII and VIII partially achieve feature-classifier alignment by introducing , greatly enhancing model performance. However, without using alternating training, local features cannot adapt well to the classifier. This leads to a significant performance gap between Settings VII, VIII, and Setting V.
This ablation study underlines the importance of each module in FedPFT. It confirms that aligning local features with the global classifier and enhancing the feature extractor’s quality are both crucial for optimizing model performance, aligning with the core motivations behind our methodology.
4.4 Separability of Features
In this section, we assess the effectiveness of FedPFT in enhancing the quality of the feature extractor by conducting linear probing experiments on CIFAR-10 and CIFAR-100. The results are shown in Table4. Higher accuracy means that the extracted features have better linear separability.
Compared to FedAvg, features extracted by ‘FedPFT w/o ’ demonstrate superior linear separability. This improvement is attributable to the alignment of local features with the global classifier in the feature learning phase, which synchronizes client training objectives and mitigates the adverse effects of non-IID data. FedPFT further improves the quality of the feature extractor by integrating a collaborative contrastive learning task. For more experimental results, please refer to AppendixD.
CIFAR-10 | CIFAR-100 | |||||
Methods | ||||||
FedAvg | 85.01% | 72.52% | 68.38% | 59.50% | 37.40% | 32.33% |
FedPFT w/o | 85.52% | 72.59% | 69.57% | 61.60% | 43.14% | 38.47% |
FedPFT | 87.83% | 77.25% | 74.02% | 64.12% | 46.43% | 40.95% |
4.5 Learned Features of Different Methods
In this subsection, we visually compare the quality of features extracted by different methods and highlight the impact of different modules in FedPFT on feature extraction. We conduct experiments on the CIFAR-10 dataset with 10 clients, each allocated 1000 training images and 500 testing images. The data distribution is shown in Figure2(a). For each method, we visualize the feature vectors of testing data from different clients using t-SNE [33]. The visualization results are depicted in Figure2(b)-(h), where colors represent different data classes, and markers represent different clients, as detailed in Figure2(a).
FedAvg and FedCAC exhibit noticeable cluster structures of features but lack strong discriminative boundaries. FedPer displays overlapping features across various classes, attributable to the use of personalized classifiers that create different local feature spaces for each client. Consequently, data from different classes across different clients are mapped to similar positions. This interference between clients reduces the quality of the global feature extractor.
‘FedPFT w/o ’ shows clearer discriminative boundaries, which is attributed to the alignment of local features with the global classifier achieved during local training. We also observe that data from the same class across different clients are mapped to the same positions in the feature space, indicating that the global classifier provides a unified feature space for all clients. Adapting local features to this space essentially aligns the training objectives among clients in non-IID scenarios, promoting collaboration among clients. FedPFT further enhances feature separability by incorporating .
‘FedPFT w/o Alter’ represents not using alternating training. While it shows better clustering than FedAvg, the discriminative quality of the boundaries is weaker compared to ‘FedPFT w/o .’ This configuration shows increased interference among client models, lacking alignment to the common global feature space. ‘FedPFT_p_classifier’ indicates using personalized classifiers. In this case, the feature space becomes highly scattered, similar to FedPer’s issue. Since we train prompt to adapt to personalized classifiers first, this exacerbates the variability in feature spaces across clients
4.6 Effect of Different Prompts
In this section, we delve into the role of prompts in FedPFT. We visualize the features transformed by different prompts using t-SNE. The experimental setup is consistent with Section4.5. The results are depicted in Figure3. Larger markers in the figures represent feature centroids of corresponding classes for each client.
It is evident that features obtained from classification prompts are not significantly correlated with image similarity but rather with the distribution of client data. For example, two classes within a client are close together. Conversely, features transformed by contrastive learning prompts are more related to image similarity. For instance, in Figure3(b), the feature centroids of ‘cat’ and ‘dog’ are closer, as are ‘truck’ and ‘automobile,’ which aligns with the principles of contrastive learning.
CIFAR-10 | CIFAR-100 | |||||
Prompt Type | ||||||
None | 87.69% | 77.12% | 73.93% | 64.08% | 46.50% | 40.79% |
87.83% | 77.25% | 74.02% | 64.12% | 46.43% | 40.95% | |
87.82% | 77.25% | 74.02% | 64.18% | 46.40% | 40.95% |
We also investigate whether different types of prompts influence feature separability. We conduct linear probe experiments using the CIFAR-10 and CIFAR-100 datasets. The results are detailed in Table5. In these experiments, we compare three conditions: ‘None’ (no prompts used), ‘’ (using classification prompts), and ‘’ (using contrastive learning prompts). Interestingly, the accuracies across different prompt conditions are generally similar, suggesting that the use of either type of prompt does not significantly impact the overall quality of the features extracted.
The above experiments demonstrate that prompts work by transforming features into the required format for downstream tasks using task-specific prompts. This also indicates the scalability and adaptability of our designed feature transformation module. It can incorporate various client-collaborative tasks beneficial for enhancing the performance of personalized models through task-specific prompts.
5 Conclusion and Discussion
We observe that the feature extractor from FedAvg surpasses those in most PFL methods, yet it suffers from inadequate performance due to a mismatch between the local features and the classifier. This mismatch issue not only impacts the performance during model inference but also affects the synergy between the feature extractor and the classifier during training. We propose a new PFL method called FedPFT with a prompt-driven feature transform module to address these issues during training. Our experiments demonstrate that FedPFT not only resolves the mismatch issue but also significantly improves the quality of the feature extractor, achieving substantial performance gains compared to state-of-the-art methods. We discuss the limitations and our future work in AppendixJ.
References
- [1]D.A.E. Acar, Y.Zhao, R.Zhu, R.Matas, M.Mattina, P.Whatmough, and V.Saligrama.Debiasing model updates for improving personalized federated training.In International Conference on Machine Learning, pages 21–31. PMLR, 2021.
- [2]M.G. Arivazhagan, V.Aggarwal, A.K. Singh, and S.Choudhary.Federated learning with personalization layers.arXiv preprint arXiv:1912.00818, 2019.
- [3]D.Chen, L.Yao, D.Gao, B.Ding, and Y.Li.Efficient personalized federated learning via sparse model-adaptation.arXiv preprint arXiv:2305.02776, 2023.
- [4]H.-Y. Chen and W.-L. Chao.On bridging generic and personalized federated learning for image classification.In International Conference on Learning Representations, 2022.
- [5]L.Collins, H.Hassani, A.Mokhtari, and S.Shakkottai.Exploiting shared representations for personalized federated learning.pages 2089–2099, 2021.
- [6]W.Deng, C.Thrampoulidis, and X.Li.Unlocking the potential of prompt-tuning in bridging generalized and personalized federated learning.arXiv e-prints, pages arXiv–2310, 2023.
- [7]A.Fallah, A.Mokhtari, and A.Ozdaglar.Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach.Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
- [8]K.He, H.Fan, Y.Wu, S.Xie, and R.Girshick.Momentum contrast for unsupervised visual representation learning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- [9]K.He, X.Zhang, S.Ren, and J.Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [10]Y.Huang, L.Chu, Z.Zhou, L.Wang, J.Liu, J.Pei, and Y.Zhang.Personalized cross-silo federated learning on non-iid data.In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- [11]M.Jia, L.Tang, B.-C. Chen, C.Cardie, S.Belongie, B.Hariharan, and S.-N. Lim.Visual prompt tuning.In European Conference on Computer Vision, 2022.
- [12]H.Jin, D.Bai, D.Yao, Y.Dai, L.Gu, C.Yu, and L.Sun.Personalized edge intelligence via federated self-knowledge distillation.IEEE Transactions on Parallel and Distributed Systems, 34(2):567–580, 2022.
- [13]A.Krizhevsky, G.Hinton, etal.Learning multiple layers of features from tiny images.2009.
- [14]A.Krizhevsky, V.Nair, and G.Hinton.Cifar-10 (canadian institute for advanced research).URL http://www. cs. toronto. edu/kriz/cifar. html, 5, 2010.
- [15]Y.Le and X.Yang.Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015.
- [16]B.Lester, R.Al-Rfou, and N.Constant.The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021.
- [17]G.Li, W.Wu, Y.Sun, L.Shen, B.Wu, and D.Tao.Visual prompt based personalized federated learning.Transactions on Machine Learning Research, 2023.
- [18]H.Li, W.Huang, J.Wang, and Y.Shi.Global and local prompts cooperation via optimal transport for federated learning.arXiv preprint arXiv:2403.00041, 2024.
- [19]T.Li, S.Hu, A.Beirami, and V.Smith.Ditto: Fair and robust federated learning through personalization.In International Conference on Machine Learning, pages 6357–6368. PMLR, 2021.
- [20]X.Li, M.JIANG, X.Zhang, M.Kamp, and Q.Dou.FedBN: Federated learning on non-IID features via local batch normalization.In International Conference on Learning Representations, 2021.
- [21]Z.Li, X.Shang, R.He, T.Lin, and C.Wu.No fear of classifier biases: Neural collapse inspired federated learning with synthetic and fixed classifier.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5319–5329, 2023.
- [22]P.P. Liang, T.Liu, L.Ziyin, N.B. Allen, R.P. Auerbach, D.Brent, R.Salakhutdinov, and L.-P. Morency.Think locally, act globally: Federated learning with local and global representations.arXiv preprint arXiv:2001.01523, 2020.
- [23]H.Liu, C.Li, Q.Wu, and Y.J. Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024.
- [24]X.Liu, K.Ji, Y.Fu, W.L. Tam, Z.Du, Z.Yang, and J.Tang.P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2110.07602, 2021.
- [25]J.Luo and S.Wu.Adapt to adaptation: Learning personalization for cross-silo federated learning.In L.D. Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2166–2173. International Joint Conferences on Artificial Intelligence Organization, 7 2022.Main Track.
- [26]X.Ma, J.Zhang, S.Guo, and W.Xu.Layer-wised model aggregation for personalized federated learning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10092–10101, 2022.
- [27]B.McMahan, E.Moore, D.Ramage, S.Hampson, and B.A. yArcas.Communication-efficient learning of deep networks from decentralized data.In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
- [28]J.Mills, J.Hu, and G.Min.Multi-task federated learning for personalised deep neural networks in edge computing.IEEE Transactions on Parallel and Distributed Systems, 33(3):630–641, 2021.
- [29]M.Shi, Y.Zhou, K.Wang, H.Zhang, S.Huang, Q.Ye, and J.Lv.Prior: Personalized prior for reactivating the information overlooked in federated learning.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, editors, Advances in Neural Information Processing Systems, volume36, pages 28378–28392. Curran Associates, Inc., 2023.
- [30]S.Su, M.Yang, B.Li, and X.Xue.Federated adaptive prompt tuning for multi-domain collaborative learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pages 15117–15125, 2024.
- [31]B.Sun, H.Huo, Y.Yang, and B.Bai.Partialfed: Cross-domain personalized federated learning via partial initialization.Advances in Neural Information Processing Systems, 34:23309–23320, 2021.
- [32]C.TDinh, N.Tran, and J.Nguyen.Personalized federated learning with moreau envelopes.Advances in Neural Information Processing Systems, 33:21394–21405, 2020.
- [33]L.Vander Maaten and G.Hinton.Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008.
- [34]L.Wang, K.Zhang, Y.Li, Y.Tian, and R.Tedrake.Does learning from decentralized non-IID unlabeled data benefit from self supervision?In The Eleventh International Conference on Learning Representations, 2023.
- [35]X.Wu, X.Liu, J.Niu, G.Zhu, and S.Tang.Bold but cautious: Unlocking the potential of personalized federated learning through cautiously aggressive collaboration.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19375–19384, October 2023.
- [36]X.Wu, J.Niu, X.Liu, T.Ren, Z.Huang, and Z.Li.pfedgf: Enabling personalized federated learning via gradient fusion.In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 639–649. IEEE, 2022.
- [37]F.-E. Yang, C.-Y. Wang, and Y.-C.F. Wang.Efficient model personalization in federated learning via client-specific prompt generation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19159–19168, 2023.
- [38]J.Zhang, Y.Hua, H.Wang, T.Song, Z.Xue, R.Ma, J.Cao, and H.Guan.Gpfl: Simultaneously learning global and personalized feature information for personalized federated learning.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5041–5051, October 2023.
- [39]G.Zhu, X.Liu, S.Tang, and J.Niu.Aligning before aggregating: Enabling communication efficient cross-domain federated learning via consistent feature extraction.IEEE Transactions on Mobile Computing, 23(5):5880–5896, 2024.
Appendix A Related Work
Current PFL methods can primarily be categorized into several major types: meta-learning-based methods [7; 1], model-regularization-based methods [32; 19], fine-tuning-based methods [12; 3; 21], personalized-weight-aggregation-based methods [10; 25], and parameter-decoupling-based methods. This paper delves into the issues inherent in the global model of FedAvg and primarily discusses parameter-decoupling methods that rely on the global model.
In addition to the aforementioned methods, a new category based on prompts has recently emerged.
Prompt-based methods.
Recently, prompt technology has garnered widespread attention in the fields of computer vision [11; 23] and natural language processing [16; 24]. This technology involves using prompts as inputs to guide the behavior or output of models, typically for fine-tuning purposes. The domain of PFL has also seen the emergence of prompt-based approaches. Most of these are based on pre-trained models, aiming to train prompts to fine-tune the pre-trained models to fit client-local data, as seen in pFedPG [37], SGPT [6], FedOTP [18], and FedAPT [30]. pFedPT [17] trains both the model and prompts, using prompts at the input level to learn personalized knowledge for fine-tuning the global model to adapt to the client’s local distributions. Our FedPFT fundamentally differs from these methods in its training objective. Rather than fine-tuning, we introduce prompts to guide feature transformations to align with the global classifier, thereby addressing the mismatch issue inherent in the global model during the training process.
Appendix B Experiment Setup
B.1 Introduction to non-IID Scenarios
Pathological non-IID.
In this setting, each client is randomly assigned data from a subset of classes with equal data volume per class. For the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, we assign 2, 20, and 40 classes of data to each client, respectively.
Dirichlet non-IID.
This is a commonly used setting in current FL research [36; 35; 29] In this scenario, the data for each client is generated from a Dirichlet distribution denoted as . As the value of increases, the class imbalance within each client’s dataset progressively decreases. This Dirichlet non-IID setting enables the evaluation of different methods across a broad spectrum of non-IID conditions, reflecting various degrees of data heterogeneity.
For a clearer, more intuitive understanding, we involve 20 clients with 10-class and 50-class datasets to visualize the data distribution among clients with varying values. As depicted in Figure 4, the horizontal axis labels the data class indices, while the vertical axis lists the client IDs. Each red dot indicates the class data assigned to a client, with larger dots signifying a higher volume of data in that class.
B.2 Introduction to Comparative Methods
FedAMP [10] is a weighted-aggregation-based method where clients with similar data distributions are given higher aggregation weights during model aggregation. Because it mainly encourages the collaboration of clients with similar data distribution, it is a method that pays more attention to the local data distribution of clients from the design point of view. FedPer [2], FedRep [5], FedBN [20], FedRoD [4], and FedCAC [35] are parameter-decoupling-based methods, which personalize the global model by retaining certain parameters locally based on FedAvg. FedRoD additionally introduces a balanced global classifier to obtain assistance from other clients, alleviating the overfitting issue caused by personalized classifiers alone. pFedSD [12] and pFedGate [3] are fine-tuning-based methods that adapt the global model to local data through fine-tuning. pFedSD directly fine-tunes the global model by distilling local models, while pFedGate trains an additional gating network and applies it to the global model. pFedPT [17], a prompt-based method, can also be viewed as a fine-tuning approach, enhancing the global model’s adaptation to local data distributions by adding prompts to images.
B.3 Hyperparameter Settings in Different Methods
For the unique hyperparameters of each baseline method, we utilize the optimal parameter combinations reported in their respective papers. For learning rates, we adjust within {1e-1, 1e-2, 1e-3}.
In FedPFT, to simplify the hyperparameter tuning process and enhance the method’s usability, we provide a default set of hyperparameters: for all scenarios, we set . We use the SGD optimizer, with a learning rate of 0.01 for the feature transformation module and 0.1 for others. In the Dirichlet non-IID scenario with scenario, we set , while in other scenarios, we set . For the contrastive learning algorithm, we adopt the default settings from MoCo. In ‘FedPFT w/o ,’ we set the learning rate of the feature transformation module to 0.05 while keeping other hyperparameters the same as FedPFT. Unless otherwise specified, our experiments use the above hyperparameter settings, although fine-tuning these parameters for different scenarios may yield better performance.
B.4 Compute Resources
All the experiments are implemented using PyTorch and conducted on NVIDIA V100 GPUs. For the methods we compared, as well as ‘FedPFT w/o ,’ a single training session requires 24-48 hours. For FedPFT, the training process takes longer due to the use of the MoCo algorithm, which requires data augmentation that can only be executed on the CPU. Consequently, a single training session for FedPFT requires 48-72 hours.
Appendix C Comparison with State-of-the-art Methods
We present the comparative results of FedPFT against established methods on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets under Pathological non-IID scenarios, as well as CIFAR-10 under Dirichlet non-IID scenarios in Tables6 and 7.
Methods | CIFAR-10 | CIFAR-100 | Tiny ImageNet |
FedAvg | 54.33 3.03 | 34.27 0.44 | 18.05 0.23 |
Local | 85.85 0.93 | 38.40 0.69 | 16.20 0.30 |
FedAMP | 88.88 0.83 | 38.36 0.79 | 16.13 0.55 |
FedPer | 87.51 0.95 | 41.54 0.74 | 20.25 0.65 |
FedRep | 87.10 0.91 | 40.63 0.74 | 19.24 0.33 |
FedBN | 87.02 1.41 | 47.75 1.03 | 24.91 0.48 |
FedRoD | 88.06 1.70 | 52.55 0.92 | 32.25 0.80 |
pFedSD | 89.97 1.45 | 52.30 1.18 | 30.27 0.78 |
pFedGate | 89.15 0.76 | 43.73 0.14 | 22.42 0.83 |
FedCAC | 89.77 1.14 | 49.07 0.87 | 30.83 0.42 |
pFedPT | 86.29 1.11 | 39.92 0.33 | 21.38 0.98 |
FedPFT w/o | 89.67 1.96 | 57.62 1.18 | 36.13 1,32 |
FedPFT | 90.55 1.35 | 58.14 0.71 | 37.59 0.39 |
Methods | |||
FedAvg | 60.39 1.46 | 60.41 1.36 | 60.91 0.72 |
Local | 81.91 3.09 | 60.15 0.86 | 52.24 0.41 |
FedAMP | 84.99 1.82 | 68.26 0.79 | 64.87 0.95 |
FedPer | 84.43 0.47 | 68.80 0.49 | 64.92 0.66 |
FedRep | 84.59 1.58 | 67.69 0.86 | 60.52 0.72 |
FedBN | 83.55 2.32 | 66.79 1.08 | 62.20 0.67 |
FedRoD | 86.23 2.12 | 72.34 1.77 | 68.45 1.94 |
pFedSD | 86.34 2.61 | 71.97 2.07 | 67.21 1.89 |
pFedGate | 87.25 1.91 | 71.98 1.61 | 67.85 0.87 |
FedCAC | 86.82 1.18 | 69.83 0.46 | 65.39 0.51 |
pFedPT | 82.38 2.91 | 67.33 1.33 | 64.37 1.22 |
FedPFT w/o | 87.23 2.69 | 74.10 1.95 | 69.23 0.76 |
FedPFT | 88.60 2.19 | 77.54 1.88 | 74.81 0.77 |
Results in Pathological non-IID scenario.
This is an extreme setting where each client has data from only a subset of classes. This scenario is particularly pronounced in the CIFAR-10 dataset, where each client essentially performs a simple binary classification task. Here, clients can achieve decent performance by solely focusing on their local tasks (‘Local’), even without collaboration with other clients. As such, methods that prioritize local data distribution, such as FedAMP, pFedSD, and pFedGate, perform well. In contrast, on CIFAR-100 and Tiny ImageNet datasets, as clients have more local classes with fewer samples per class, local tasks become more challenging. Effective collaboration with other clients becomes crucial. Consequently, methods such as FedRoD, which emphasize client collaboration, exhibit increasingly significant performance. FedAMP and pFedGate show considerable performance degradation. FedPer, FedRep, FedBN, and FedCAC, by personalizing certain parameters of FedAvg, enhance local performance by indirectly aligning local features with classifiers to some extent. However, as they do not address the mismatch issue, they compromise the performance of feature extractors to some extent, thereby limiting their performance to a moderate level across the three datasets. ‘FedPFT w/o ’ aligns local features with the global feature space using classification prompts, enhancing both local feature-classifier alignment and inter-client collaboration effectiveness. It achieves competitive performance on CIFAR-10 and surpasses existing SOTA methods on CIFAR-100 and Tiny ImageNet. FedPFT further incorporates contrastive learning tasks to enhance feature extractor performance, outperforming SOTA methods significantly across all datasets.
Appendix D Feature Separability of Different Methods
CIFAR-10 | CIFAR-100 | |||||
Methods | ||||||
FedAvg | 85.01% | 72.52% | 68.38% | 59.50% | 37.40% | 32.33% |
FedPer | 84.44% | 71.07% | 66.51% | 52.09% | 26.61% | 20.51% |
FedBN | 84.52% | 70.15% | 66.51% | 57.86% | 35.24% | 30.28% |
FedCAC | 85.22% | 71.56% | 66.98% | 56.86% | 34.64% | 29.35% |
FedRoD | 82.79% | 67.07% | 63.12% | 56.88% | 33.99% | 29.22% |
pFedSD | 85.86% | 72.42% | 68.12% | 60.07% | 37.33% | 31.99% |
FedPFT w/o | 85.52% | 72.59% | 69.57% | 61.60% | 43.14% | 38.47% |
FedPFT | 87.83% | 77.25% | 74.02% | 64.12% | 46.43% | 40.95% |
In this section, we delve deeper into the linear separability of features extracted by various PFL methods. Linear separability is a critical measure of feature quality, indicating the ability of a model to distinguish between classes using simple linear classifiers. We conduct linear probing experiments on the CIFAR-10 and CIFAR-100 datasets to assess this metric, with results detailed in Table8.
It can be observed that the feature linear separability of most PFL methods is inferior to FedAvg. This indicates that although they partially alleviate the mismatch issue and achieve better model performance, the quality of the feature extractor is inevitably compromised due to their design, constraining the full potential of PFL.
In stark contrast, FedPFT significantly improves the linear separability of features compared to FedAvg. Our method accomplishes this by fundamentally addressing the mismatch issue during the training process rather than merely adapting the model post hoc. This proactive approach ensures that the feature extractor not only aligns more closely with the global classifier but also preserves its ability to generalize across diverse data distributions. Consequently, FedPFT enhances both the performance and the utility of the feature extractor.
Appendix E Comparison with Two-stage Approach
In FedPFT, we propose using a feature transformation module to coordinate the joint training of contrastive learning and classification tasks. To illustrate the superiority of this design, we introduce a baseline called ‘Two-stage,’ similar to [34], where contrastive learning training is conducted first, followed by classification task training after convergence. For fairness, in the two-stage method, we first perform 1000 rounds of contrastive learning training, followed by 1000 rounds of classification task training. The experimental results are depicted in Figure5.
(a)
MethodsTwo-stage 53.43 43.87Ours 62.03 47.98(b) Accuracy (%)
(c)
Firstly, from the perspective of the contrastive learning loss (), FedPFT registers lower loss values compared to the Two-stage approach, suggesting that simultaneous training with the classification task enhances the efficacy of contrastive learning. Secondly, considering both Figure5(b) and Figure5(c), our method exhibits significantly higher accuracy compared to the Two-stage approach. However, converges to a higher training loss value, suggesting that in our design, contrastive learning tasks can alleviate overfitting issues in the classification task during training. These experiments demonstrate that our proposed approach can effectively coordinate both tasks, allowing them to assist each other. Importantly, these experiments also indicate that the significant performance improvement brought by contrastive learning in our method is largely attributed to the design of our feature transformation module and training approach.
Appendix F Attention Weight Visualization
In the feature transformation module of FedPFT, self-attention mechanisms are employed to facilitate the integration of prompts with sample features. This section visualizes the attention weights to reveal how prompts influence the transformation process. We analyze 20 test samples from a single client on the CIFAR-10 dataset, with results depicted in Figure6. Each row in the figure corresponds to the attention weights for the output feature of a single sample. Columns represent the input dimensions of the transformation module: the first column corresponds to the original input feature , while subsequent columns relate to different prompts from the sets or .
It can be observed that when , indicating severe local class imbalances, each client has data from only a few classes. In this case, the feature transformation task is relatively simple, and the influence of different prompts on a sample is similar. As increases, indicating more complex local tasks, the influence of prompts becomes more intricate. Particularly at , it can be seen that each sample is affected differently by different prompts. This also indicates that our approach performs sample-level feature transformation.
Appendix G Partial Client Participation
In FL, challenges such as offline clients and unstable communication may result in only a subset of clients participating in training each round, posing a challenge to the robustness of FL algorithms. In this section, we investigate whether FedPFT is robust to this issue. We conduct experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet, considering scenarios where only a random 50%, 70%, and 90% of clients participate in training each round. The experimental results are presented in Table9.
Datasets | 100% | 90% | 70% | 50% |
CIFAR-10 | 88.602.19 | 88.502.01 (-0.10) | 88.572.53 (-0.03) | 88.691.83 (+0.09) |
CIFAR-100 | 62.031.41 | 61.650.41 (-0.38) | 63.540.55 (+1.51) | 63.970.10 (+1.94) |
Tiny | 43.421.62 | 43.032.09 (-0.39) | 44.591.19 (+1.17) | 45.811.02 (+2.39) |
It can be observed that compared to scenarios where all clients participate in training, FedPFT’s accuracy is not significantly reduced when only a subset of clients participate. Furthermore, in CIFAR-100 and Tiny ImageNet, the performance of FedPFT may even be improved. This is because reducing the number of participating clients each round may mitigate the impact of non-IID data distribution on the global model. These experiments demonstrate the robustness of FedPFT to scenarios where only a subset of clients participate.
Appendix H Effect of Hyperparameters
In the previous experiments, we utilize the default hyperparameter combination. In this section, we verify how variations in these hyperparameters influence the performance of FedPFT.
H.1 Effect of and
and respectively represent the number of prompts in and for each client. We examine the impact of these two hyperparameters on the performance of FedPFT on CIFAR-10 and CIFAR-100 datasets. When assessing the effect of , we hold constant. Similarly, when evaluating the impact of , is fixed at 10. The experimental results are depicted in Figure7 and Figure8.
FedPFT shows considerable robustness to variations in these hyperparameters. On the CIFAR-10 dataset, changes in and have minimal impact on performance, suggesting that the model can effectively handle simpler data distributions even with fewer prompts. In contrast, on the more complex CIFAR-100 dataset, performance is initially limited by a smaller number of prompts, which may not sufficiently cover the diverse feature space required for effective feature transformation. As the number of prompts increases, the model’s ability to transform and adapt features improves, leading to enhanced performance.
H.2 Effect of and
and are used to control the number of training epochs for the two training stages. Since we set , in this experiment, we only adjust to examine the impact of these two hyperparameters on model performance. The experimental results are illustrated in Figure9.
When , it indicates that contrastive learning is not used to train the feature extractor, and local features are not aligned with the global classifier before training model parameters. It can be observed that the model performance is very poor under this condition. As gradually increases, the model performance shows a trend of initially increasing and then decreasing. This suggests that essentially balances the trade-off between the two training stages. When is small, the feature extractor is predominantly trained by , and the classifier undergoes more training epochs. At this point, the model pays more attention to the local data distribution of clients, but collaboration among clients is also more susceptible to non-IID effects. Conversely, when is large, the feature extractor is primarily trained by , focusing more on general features, and collaboration among clients is less affected by non-IID issues. However, because the model is rarely trained with , it pays less attention to the local data distribution of clients, resulting in poorer performance on local data.
In general, and are two hyperparameters that need to be carefully adjusted, as they have a significant impact on the performance of FedPFT. Typically, in scenarios where the local tasks of clients are simple, it may be appropriate to decrease the value of . In other cases, we recommend using a larger value of to enhance the degree of collaboration among clients.
Appendix I Communication Cost
In this section, we calculate the communication overhead of one client in FedAvg and FedPFT in each communication round.
Model | FedAvg | FedPFT | Incre. Ratio | ||||
ResNet-8 | 1.24M | 0.26M | 25.70K | 32.90K | 1.27M | 1.56M | 23.14% |
ResNet-10 | 4.91M | 1.05M | 51.30K | 65.66K | 4.96M | 6.08M | 22.49% |
In FedAvg, each communication round involves uploading the feature extractor and the classifier . FedPFT adds the feature transformation module and the feature projection layer , thereby increasing the volume of parameters transmitted per round. According to the results presented in Table10, the communication overhead for FedPFT using ResNet-8 and ResNet-10 architectures is increased by 23.14% and 22.49%, respectively, relative to FedAvg.
While FedPFT brings additional communication cost, it is important to weigh it against the performance enhancements and scalability offered by , as discussed in earlier sections of this paper. The improved model accuracy and robustness to non-IID data might justify the additional costs in scenarios where model performance is critical.
Moving forward, considering the increase in communication cost is primarily due to the additional components , we aim to develop a more efficient and lightweight feature transformation module to reduce communication demands without compromising model effectiveness in our future work.
Appendix J Limitations and Future Work
In this paper, we primarily investigate PFL methods that derive personalized models based on a global model. We analyze the essential reasons these methods enhance performance from the perspective of mismatches between local features and classifiers. Although such methods occupy the mainstream in the current PFL field, it is necessary to admit that there are some PFL methods that are not based on global models, such as personalized-weight-aggregation-based methods, which are not explored in this study. Additionally, while this paper observes that personalizing a subset of parameters degrades the quality of the feature extractor, the underlying reasons for this phenomenon require further investigation.
Appendix K Theoretical Analysis
Since the main problem in Eq.(1) is non-convex, we focus on the factors affecting convergence in the non-convex setting.
Implication | Notation |
Global / Local loss | / |
Global / Local problem | / |
Local Dataset on client | |
Feature extractor | |
Feature transformation module | |
Classification / Contrastive learning prompts | / |
Feature extractor & Feature transformation module & Classifier | |
Classification / Contrastive learningtask head | / |
Global / Local problem’s gradient | / |
Local gradient approximation | |
Client number | |
Local update epoch | |
The number of clients sampled at each global epoch | |
The set of clients sampled at global epoch | |
The actual learning rate of global problem | |
The learning rate of local problem | |
Approximated local gradient error’s upper-bound | |
Local-global gradient error’s upper-bound | |
Index of client, local epoch and global epoch |
K.1 Problem Setup
Non-convex case analyses are as follows. By Lagrange duality, the main problem is transformed as follows:
We transform the problem into an unconditional bi-level optimization problem:
where represents the expectation of all random variables, means the expectation of client sampling, is the local data sampling expectation, and we use for simplification, based on the equivalence of block coordinate descent and gradient descent.
K.2 Propositions
Proposition K.1 (-smooth).
If is -smooth, we have:
Proposition K.2 (Jensen’s inequality).
If is convex, we have the following inequality:
A variant of the general one shown above, given a group :
Proposition K.3 (Triangle inequality).
The triangle inequality, where is the norm, and , is the elements in the corresponding norm space:
Proposition K.4 (Matrix norm compatibility).
The matrix norm compatibility, :
Proposition K.5 (Peter Paul inequality).
and , we have the following inequality:
K.3 Assumptions
Assumption K.1 (L-smooth local objectives).
, is -Smooth, the main proposition is shown in Prop.K.1. Notice that the is assumed to be L-smooth and non-convex, which matches the problem and neural network architecture setting in the main paper.
Assumption K.2 (Bounded local variance).
The local problem’s gradient is assumed not to be too far from the global problem’s gradient.
Assumption K.3 (Bounded approximated gradient).
The first-order approximation of the local problem’s gradient should not be too far from the ground truth . In this assumption, the approximated error of the block coordinate descent in Algorithm1 is bounded.
K.4 Lemmas
Lemma K.1 (Bounded local approximation error).
If , we have the following bound of client drift error:
Proof.
The client drift error on given client and its upper bound are as follows:
(7) | ||||
where the first inequality is by PropositionK.3 and the second one is by AssumptionK.1.
For the last term in the upper bound, we have the iterative formulation as follows:
where the two inequalities are by PropositionK.3, PropositionK.5 and Eq.(7).
Take , we recursively unroll the inequality as follows:
where the inequality is unrolled and we use . Thus, we have:
∎
K.5 Theorem and Discussion
Theorem K.2 (Non-convex and smooth convergence of FedPFT).
Let AssumptionK.1, AssumptionK.2 and AssumptionK.3 hold, if is taken, where , we have the following bound:
Proof.
where the four inequalities are respectively by -smooth of , PropositionK.5, LemmaK.1 and the similar classic Lemma 4 in [29].
Let , , ,
where let , where . Re-arranging the inequality above and accumulating, we have:
Let , where is the minimum of the main problem . To measure the exact term of the bounds, we consider the following cases:
- •
or , let ,we have:
- •
and , let ,we have:
Uniformly sample a , we have the upper bound as follows:
∎
Remark K.2.1.
According to TheoremK.2, our proposed FedPFT converges at a sub-linear level. The linear term is affected by and the initialization gap . The sub-linear term is affected by , especially when is large due to the exponential factor . As the local approximation error of the gradient grows, both the convergence radius and the sub-linear term are affected by the local optimizer selection significantly. Another sub-linear term is eliminated if when all the clients are sampled. Otherwise, the sub-linear rate is mainly affected by .
FedPFT aligns the training objectives across clients by introducing and reduces the impact of non-IID data on the feature extractor through contrastive learning. Both of these designs can effectively reduce differences in local gradients among clients during training, thereby reducing and subsequently lowering the upper bound. During training,