Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (2024)

Table of Contents
1 Introduction 2 Related Work 3 Methodology 3.1 Overview of FedPFT 3.2 Problem Formulation 3.3 Feature Transformation Module 3.4 Classification Task with Personalized Prompts 3.5 Contrastive Learning Task 3.6 Alternating Training Strategy 4 Experiments 4.1 Experimental Setup 4.2 Comparison with State-of-the-art Methods 4.3 Ablation Study 4.4 Separability of Features 4.5 Learned Features of Different Methods 4.6 Effect of Different Prompts 5 Conclusion and Discussion References Appendix A Related Work Appendix B Experiment Setup B.1 Introduction to non-IID Scenarios B.2 Introduction to Comparative Methods B.3 Hyperparameter Settings in Different Methods B.4 Compute Resources Appendix C Comparison with State-of-the-art Methods Appendix D Feature Separability of Different Methods Appendix E Comparison with Two-stage Approach Appendix F Attention Weight Visualization Appendix G Partial Client Participation Appendix H Effect of Hyperparameters H.1 Effect of nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT and nρsubscript𝑛𝜌n_{\rho}italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT H.2 Effect of Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Appendix I Communication Cost Appendix J Limitations and Future Work Appendix K Theoretical Analysis K.1 Problem Setup K.2 Propositions K.3 Assumptions K.4 Lemmas K.5 Theorem and Discussion

Xinghao Wu1,∗, Jianwei Niu1,2, Xuefeng Liu1,2,12{}^{1,2,\text{\textdagger}}start_FLOATSUPERSCRIPT 1 , 2 , † end_FLOATSUPERSCRIPT, Mingjia Shi3,
Guogang Zhu1, and Shaojie Tang4
1
School of Computer Science and Engineering, Beihang University, Beijing, China
2Zhongguancun Laboratory, Beijing, China
3 Sichuan University, Sichuan, China
4Department of Management Science and Systems, University at Buffalo

Abstract

In traditional Federated Learning approaches like FedAvg, the global model underperforms when faced with data heterogeneity. Personalized Federated Learning (PFL) enables clients to train personalized models to fit their local data distribution better. However, we surprisingly find that the feature extractor in FedAvg is superior to those in most PFL methods. More interestingly, by applying a linear transformation on local features extracted by the feature extractor to align with the classifier, FedAvg can surpass the majority of PFL methods. This suggests that the primary cause of FedAvg’s inadequate performance stems from the mismatch between the locally extracted features and the classifier. While current PFL methods mitigate this issue to some extent, their designs compromise the quality of the feature extractor, thus limiting the full potential of PFL.In this paper, we propose a new PFL framework called FedPFT to address the mismatch problem while enhancing the quality of the feature extractor. FedPFT integrates a feature transformation module, driven by personalized prompts, between the global feature extractor and classifier. In each round, clients first train prompts to transform local features to match the global classifier, followed by training model parameters. This approach can also align the training objectives of clients, reducing the impact of data heterogeneity on model collaboration. Moreover, FedPFT’s feature transformation module is highly scalable, allowing for the use of different prompts to tailor local features to various tasks. Leveraging this, we introduce a collaborative contrastive learning task to further refine feature extractor quality. Our experiments demonstrate that FedPFT outperforms state-of-the-art methods by up to 7.08%.

footnotetext: wuxinghao@buaa.edu.cnfootnotetext: {}^{\text{\textdagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPTCorresponding author

1 Introduction

Federated Learning (FL) allows all clients to train a global model collaboratively without sharing their raw data. A key challenge in FL is data heterogeneity, meaning the data across different clients is not independently and identically distributed (non-IID). This issue results in degraded performance of the global model trained in conventional FL methods such as FedAvg [27].

To address this issue, Personalized Federated Learning (PFL) has been proposed, which allows clients to train personalized models to fit their local data distribution better. Many current PFL methods achieve personalization by personalizing some parameters of the global model. For example, FedPer [2] personalizes classifiers, FedBN [20] personalizes BN layers, AlignFed [39] personalizes feature extractors, and FedCAC [35] selects parameters susceptible to non-IID effect for personalization.

CIFAR-10, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5CIFAR-10, α=1.0𝛼1.0\alpha=1.0italic_α = 1.0
MethodsProbe Acc.Origin Acc.Match Acc.Probe Acc.Origin Acc.Match Acc.
FedAvg72.52%59.66%72.60%68.38%60.33%68.37%
FedPer71.07%68.86%71.03%66.51%64.83%66.75%
FedBN70.15%66.20%70.60%66.51%62.97%66.80%
FedCAC71.56%68.71%71.63%66.98%64.90%67.11%
Ours77.25%77.06%77.68%74.02%73.88%74.75%

Although the above methods demonstrate significant performance improvements over the global model, an interesting observation emerged from our experiments: the feature extractor derived from FedAvg outperforms those in most PFL methods. Specifically, we conduct linear probe experiments in which each client employs a randomly initialized linear classifier (probe) behind the feature extractor, and this classifier is subsequently retrained. As evident from Table1, the Probe Acc. of FedAvg exceeds that of the PFL methods, indicating that the features extracted by FedAvg exhibit superior linear separability. This suggests that FedAvg has greater potential to outperform PFL methods.

These findings prompt us to further explore why FedAvg underperforms on client-local data compared to PFL methods. To unveil this puzzle, we introduce a linear layer between the global feature extractor and the classifier on each client, training this layer with local data to align the features with the classifier. According to the Match Acc. in Table1, applying a linear transformation to local features significantly improves accuracy over the original model (Origin Acc.), even exceeding the Origin Acc. of current PFL methods. This indicates that the fundamental reason for the global model’s inadequate performance lies in the mismatch between local features and the global classifier.

Further experiments with PFL methods demonstrate that while they somewhat mitigate the mismatch issue, their design inadvertently degrades the quality of the feature extractor, leading to a lower Match Acc. compared to FedAvg. More importantly, current PFL methods still face issues of mismatch. This problem not only diminishes model accuracy during inference but also affects the synergy between the feature extractor and the classifier during training, ultimately impacting the feature extractor’s quality. These observations suggest that significant untapped potential remains within PFL.

In PFL, targeted designs are imperative to tackle the mismatch problem during training and improve the quality of the feature extractor. Hence, we introduce a novel PFL method called FedPFT. Drawing inspiration from prompt technology [11], which utilizes prompts as inputs to guide a model’s behavior, FedPFT integrates a vision-prompt-driven feature transformation module between the global feature extractor and classifier. In each iteration, FedPFT initially trains prompts to guide local feature transformation to align with the global classifier. This process aligns the local features with the global feature space partitioned by the classifier, thereby achieving alignment of training objectives among clients. Subsequently, training the model parameters based on this alignment can alleviate the impact of non-IID data on client collaboration and enhance the quality of the feature extractor.

Furthermore, our proposed framework exhibits notable scalability. Clients’ local features can be transformed by different task-specific prompts to accommodate various tasks. Leveraging this capability, we introduce a collaborative contrastive learning task among clients to further enhance the quality of the feature extractor. As evidenced in Table 1, our method not only resolves the mismatch issue but also significantly improves the quality of the feature extractor.

Our main contributions can be summarized as follows:

  • We identify the root cause of the inadequate performance of the global model stemming from the mismatch between local features and the classifier. The reason personalizing some parameters can improve performance is that it alleviates the impact of this issue. This provides a new perspective for future PFL approaches to better address the non-IID problem.

  • We propose a new PFL framework, which incorporates a feature transformation module to align local features with the global classifier. This approach not only resolves the mismatch problem but also significantly enhances the performance of the feature extractor.

  • Our experiments on multiple datasets and non-IID scenarios demonstrate the superiority of FedPFT, outperforming state-of-the-art methods by up to 7.08%.

2 Related Work

PFL is a kind of effective approach to address the challenges of non-IID data in FL. There is a surge of methodologies within PFL, with parameter decoupling methods gaining significant attention due to their simplicity and effectiveness, thus becoming one of the mainstream research directions in PFL. For a more detailed summary of other categories of PFL methods, please refer to AppendixA.

Parameter decoupling methods aim to decouple a subset of parameters from the global model for personalization. Approaches such as FedPer [2], FedRep [5], and GPFL [38] focus on personalizing the classifier. In contrast, methods like LG-FedAvg [22], and AlignFed [39] advocate for the personalization of the feature extractor. Additionally, FedBN [20] and MTFL [28] propose personalizing batch normalization (BN) layers within the feature extractor. Techniques employing deep reinforcement learning [31] or hypernetworks [26] have been used to determine which specific layers to personalize. The recent FedCAC [35] method advances this by introducing a metric for parameter-wise selection.

These decoupling methods help alleviate the mismatch issue within the global model by allowing local parameter adjustments. For instance, personalized classifiers involve local adjustments to the classifier to match it with the local features extracted by the global feature extractor. However, these methods do not completely resolve the mismatch issue during training. Personalizing parameters often reduce the extent of client information exchange, which can diminish the overall quality of the feature extractor, thus limiting the potential benefits of PFL.

3 Methodology

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (1)


3.1 Overview of FedPFT

Before delving into the details of FedPFT, we first provide an overview, as illustrated in Figure1(a). Each training round in FedPFT includes five key steps:

(1) Clients download the global models, which include the feature extractor ϕitalic-ϕ\phiitalic_ϕ, feature transformation module τ𝜏\tauitalic_τ, classifier hκsubscript𝜅h_{\kappa}italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, and feature projection head hρsubscript𝜌h_{\rho}italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. These models serve to initialize the corresponding local models {ϕi,τi,hκ,i,hρ,i}subscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝜅𝑖subscript𝜌𝑖\{\phi_{i},\tau_{i},h_{\kappa,i},h_{\rho,i}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT }.

(2) Each client i𝑖iitalic_i updates the ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and hρ,isubscript𝜌𝑖h_{\rho,i}italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT to minimize contrastive learning loss LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT, aiming to enhance the generalization of the feature extractor. It also updates τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and classification prompts pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT with the cross-entropy loss LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT to align local features with the global classifier.

(3) Each client i𝑖iitalic_i freezes the prompts pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT and trains ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and hκ,isubscript𝜅𝑖h_{\kappa,i}italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT using LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT to adapt the model to the classification task. It also makes the contrastive learning prompts pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT trainable to align features with the contrastive learning task.

(4) Clients upload {ϕi,τi,hκ,i,hρ,i}subscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝜅𝑖subscript𝜌𝑖\{\phi_{i},\tau_{i},h_{\kappa,i},h_{\rho,i}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT } to the server while retaining {pκ,i,pρ,i}subscript𝑝𝜅𝑖subscript𝑝𝜌𝑖\{p_{\kappa,i},p_{\rho,i}\}{ italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT } locally.

(5) The server aggregates the models uploaded by the clients.

3.2 Problem Formulation

In PFL, N𝑁Nitalic_N clients train their personalized models wi,i[N]subscript𝑤𝑖𝑖delimited-[]𝑁w_{i},i\in[N]italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_N ] under the coordination of a server, aiming for each wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to perform well on client data distribution 𝔻isubscript𝔻𝑖\mathbb{D}_{i}blackboard_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This objective can be formalized as min{wi}i[N]1Ni=1NLi(wi;𝔻i)subscriptsubscriptsubscript𝑤𝑖𝑖delimited-[]𝑁1𝑁superscriptsubscript𝑖1𝑁subscript𝐿𝑖subscript𝑤𝑖subscript𝔻𝑖\min_{\{w_{i}\}_{i\in[N]}}\frac{1}{N}\sum_{i=1}^{N}L_{i}(w_{i};\mathbb{D}_{i})roman_min start_POSTSUBSCRIPT { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; blackboard_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the loss function of the i𝑖iitalic_i-th client.

In this paper, our goal is to enhance personalized models by addressing the mismatch problem between local features and the classifier in the global model and improving the quality of the feature extractor. Thus, the training objective of FedPFT can be formulated as:

minϕ,τ,hκmin{pκ,i}i[N]𝐄i{Li(ϕ,τ,hκ,pκ,i;di):=𝐄di[LCE(ϕ,τ,hκ,pκ,i;di)+LCon(ϕ,τ;di)]},subscriptitalic-ϕ𝜏subscript𝜅subscriptsubscriptsubscript𝑝𝜅𝑖𝑖delimited-[]𝑁subscript𝐄𝑖assignsubscript𝐿𝑖italic-ϕ𝜏subscript𝜅subscript𝑝𝜅𝑖subscript𝑑𝑖subscript𝐄subscript𝑑𝑖delimited-[]subscript𝐿CEitalic-ϕ𝜏subscript𝜅subscript𝑝𝜅𝑖subscript𝑑𝑖subscript𝐿Conitalic-ϕ𝜏subscript𝑑𝑖\min_{\phi,\tau,h_{\kappa}}\min_{\{p_{\kappa,i}\}_{i\in[N]}}\mathbf{E}_{i}\{L_%{i}(\phi,\tau,h_{\kappa},p_{\kappa,i};d_{i}):=\mathbf{E}_{d_{i}}[L_{\text{CE}}%(\phi,\tau,h_{\kappa},p_{\kappa,i};d_{i})+L_{\text{Con}}(\phi,\tau;d_{i})]\},roman_min start_POSTSUBSCRIPT italic_ϕ , italic_τ , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ , italic_τ , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) := bold_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_ϕ , italic_τ , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT ( italic_ϕ , italic_τ ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] } ,(1)

where ϕitalic-ϕ\phiitalic_ϕ and hκsubscript𝜅h_{\kappa}italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT represent the feature extractor and classifier of the global model, respectively. τ𝜏\tauitalic_τ is the newly introduced global feature transformation module. This module, along with the classification prompt pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT, transforms local features to align with the global classifier. LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT denotes the cross-entropy loss for classification tasks, while LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT represents the contrastive learning loss designed to enhance the feature extractor’s quality. disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the local data of the client.

3.3 Feature Transformation Module

In FedPFT, we introduce a global feature transformation module τ𝜏\tauitalic_τ, along with a set of prompts pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT for each client i𝑖iitalic_i, to align the features extracted by the feature extractor ϕitalic-ϕ\phiitalic_ϕ with the global classifier.

Formally, given a sample xjdisubscript𝑥𝑗subscript𝑑𝑖x_{j}\in d_{i}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, extracted by the feature extractor ϕitalic-ϕ\phiitalic_ϕ, the obtained feature is fjmsubscript𝑓𝑗superscript𝑚f_{j}\in\mathbb{R}^{m}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where m𝑚mitalic_m is the feature dimension. A collection of n𝑛nitalic_n prompts is denoted as p={𝒑km|k,1kn}𝑝conditional-setsuperscript𝒑𝑘superscript𝑚formulae-sequence𝑘1𝑘𝑛p=\{\boldsymbol{p}^{k}\in\mathbb{R}^{m}|k\in\mathbb{N},1\leq k\leq n\}italic_p = { bold_italic_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_k ∈ blackboard_N , 1 ≤ italic_k ≤ italic_n }. The operation of the feature transformation module is formulated as

[fj,p]=τ([fj,p]),superscriptsubscript𝑓𝑗superscript𝑝𝜏subscript𝑓𝑗𝑝[f_{j}^{\prime},p^{\prime}]=\tau([f_{j},p]),[ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] = italic_τ ( [ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p ] ) ,(2)

where [,][\cdot,\cdot][ ⋅ , ⋅ ] signifies stacking and concatenation along the sequence length dimension, yielding [fj,p](1+n)×msuperscriptsubscript𝑓𝑗superscript𝑝superscript1𝑛𝑚[f_{j}^{\prime},p^{\prime}]\in\mathbb{R}^{(1+n)\times m}[ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_n ) × italic_m end_POSTSUPERSCRIPT. The fjsuperscriptsubscript𝑓𝑗f_{j}^{\prime}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the transformed feature. An example of the feature transform module is illustrated in Figure1(b).

The feature transformation module essentially adapts local features for downstream tasks, providing good scalability. We can introduce tasks beneficial for client collaboration using different task-specific prompts p𝑝pitalic_p. Leveraging this, FedPFT additionally introduces a contrastive learning task and utilizes contrastive learning prompts pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT for feature transformation. We denote nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT and nρsubscript𝑛𝜌n_{\rho}italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT as the number of prompts contained in pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT and pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT, respectively.

3.4 Classification Task with Personalized Prompts

The classifier is highly susceptible to the influence of non-IID data, leading to a mismatch between the global classifier and local features. Different from the previous methods, which personalize the classifier to match local features, we find that using a global classifier provides clients with a unified feature partition space. Clients aligning features with this space not only solves the mismatch problem but also aligns training objectives among clients, reducing the impact of non-IID on collaboration.

To implement this, we retain the global feature extractor and classifier while employing a set of personalized classification prompts pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT to transform each client i𝑖iitalic_i’s local features to better align with the global classifier. Specifically, the classification loss in each client i𝑖iitalic_i is defined as:

LCE(ϕ,τ,pκ,i,hκ;x,y)=logc=1Cyclog(oi,c),wherex,ydi.formulae-sequencesubscript𝐿CEitalic-ϕ𝜏subscript𝑝𝜅𝑖subscript𝜅𝑥𝑦superscriptsubscript𝑐1𝐶subscript𝑦𝑐subscript𝑜𝑖𝑐where𝑥similar-to𝑦subscript𝑑𝑖L_{\text{CE}}(\phi,\tau,p_{\kappa,i},h_{\kappa};x,y)=-\log\sum_{c=1}^{C}y_{c}%\log(o_{i,c}),\text{where }x,y\sim d_{i}.italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_ϕ , italic_τ , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ; italic_x , italic_y ) = - roman_log ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_log ( italic_o start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT ) , where italic_x , italic_y ∼ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .(3)

C𝐶Citalic_C is the number of classes, and oi=Softmax(hκτ([ϕ(x),pκ,i]))subscript𝑜𝑖Softmaxsubscript𝜅𝜏italic-ϕ𝑥subscript𝑝𝜅𝑖o_{i}=\text{Softmax}(h_{\kappa}\circ\tau([\phi(x),p_{\kappa,i}]))italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Softmax ( italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∘ italic_τ ( [ italic_ϕ ( italic_x ) , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT ] ) ) represents the predicted probabilities, with oi,csubscript𝑜𝑖𝑐o_{i,c}italic_o start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT being the ones of class c𝑐citalic_c. Details on coordinating the training of the model and prompts to achieve feature and classifier alignment are discussed in Section3.6.

3.5 Contrastive Learning Task

Contrastive learning tasks have shown robustness to the challenges posed by non-IID data distributions [34]. To further enhance the quality of the model’s feature extractor, we introduce a contrastive learning task using the Momentum Contrast (MoCo) [8] framework. The associated contrastive loss function is defined as:

LCon(ϕ,τ,pρ,i,hρ;x)=logexp(qk+/β)j=0Kexp(qkj/β),wherexdi.formulae-sequencesubscript𝐿Conitalic-ϕ𝜏subscript𝑝𝜌𝑖subscript𝜌𝑥𝑞subscript𝑘𝛽superscriptsubscript𝑗0𝐾𝑞subscript𝑘𝑗𝛽similar-towhere𝑥subscript𝑑𝑖L_{\text{Con}}(\phi,\tau,p_{\rho,i},h_{\rho};x)=-\log\frac{\exp\left(q\cdot k_%{+}/\beta\right)}{\sum_{j=0}^{K}\exp\left(q\cdot k_{j}/\beta\right)},\text{%where }x\sim d_{i}.italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT ( italic_ϕ , italic_τ , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ; italic_x ) = - roman_log divide start_ARG roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_β ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_β ) end_ARG , where italic_x ∼ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .(4)

hρsubscript𝜌h_{\rho}italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT is the projection head used for contrastive learning. In this formula, q=hρτ([ϕ(x),pρ,i])𝑞subscript𝜌𝜏italic-ϕsuperscript𝑥subscript𝑝𝜌𝑖q=h_{\rho}\circ\tau([\phi(x^{\prime}),p_{\rho,i}])italic_q = italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ∘ italic_τ ( [ italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT ] ) represents the query vector, and k+=h~ρτ([ϕ~(x′′),pρ,i])subscript𝑘subscript~𝜌𝜏~italic-ϕsuperscript𝑥′′subscript𝑝𝜌𝑖k_{+}=\tilde{h}_{\rho}\circ\tau([\tilde{\phi}(x^{\prime\prime}),p_{\rho,i}])italic_k start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ∘ italic_τ ( [ over~ start_ARG italic_ϕ end_ARG ( italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT ] ) denotes the positive key vector. Here, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and x′′superscript𝑥′′x^{\prime\prime}italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT are augmented versions of the sample x𝑥xitalic_x, ϕ~~italic-ϕ\tilde{\phi}over~ start_ARG italic_ϕ end_ARG and h~ρsubscript~𝜌\tilde{h}_{\rho}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT refer to the momentum-updated encoder and projection head, respectively. β𝛽\betaitalic_β is a temperature hyperparameter, and K𝐾Kitalic_K is the number of negative samples drawn from MoCo’s queue, comprising the set {kj}j=0Ksuperscriptsubscriptsubscript𝑘𝑗𝑗0𝐾\{k_{j}\}_{j=0}^{K}{ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

3.6 Alternating Training Strategy

To effectively coordinate the training of different modules in FedPFT, we propose an alternating training strategy, which partitions each local training round into two distinct phases: the feature learning phase and the task adaptation phase.

Feature Learning Phase.

In this phase, the training objective can be formulated as

minpκ,i,ϕi,τi,hρ,i{LCE(ϕi,τi,pκ,i,hκ;di)+LCon(ϕi,τi,pρ,i,hρ,i;di)}.subscriptsubscript𝑝𝜅𝑖subscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝜌𝑖subscript𝐿CEsubscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝑝𝜅𝑖subscript𝜅subscript𝑑𝑖subscript𝐿Consubscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝑝𝜌𝑖subscript𝜌𝑖subscript𝑑𝑖\mathop{\min}\limits_{p_{\kappa,i},\phi_{i},\tau_{i},h_{\rho,i}}\left\{L_{%\text{CE}}(\phi_{i},\tau_{i},p_{\kappa,i},h_{\kappa};d_{i})+L_{\text{Con}}(%\phi_{i},\tau_{i},p_{\rho,i},h_{\rho,i};d_{i})\right\}.roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } .(5)

LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT trains the classification prompts pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT, aligning local features with the global classifier hκsubscript𝜅h_{\kappa}italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, while LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT is aimed at training the feature extractor ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to derive general feature representations and mitigate the impact of non-IID data. Notably, during this phase, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is exclusively updated by LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT.

Task Adaptation Phase.

Following the above phase, this phase refines the previously learned features for the classification task and further enhances the classifier. The training objective is

minϕi,τi,hκ,i,pρ,i{LCE(ϕi,τi,pκ,i,hκ,i;di)+LCon(ϕi,τi,pρ,i,hρ,i;di)}.subscriptsubscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝜅𝑖subscript𝑝𝜌𝑖subscript𝐿CEsubscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝑝𝜅𝑖subscript𝜅𝑖subscript𝑑𝑖subscript𝐿Consubscriptitalic-ϕ𝑖subscript𝜏𝑖subscript𝑝𝜌𝑖subscript𝜌𝑖subscript𝑑𝑖\mathop{\min}\limits_{\phi_{i},\tau_{i},h_{\kappa,i},p_{\rho,i}}\left\{L_{%\text{CE}}(\phi_{i},\tau_{i},p_{\kappa,i},h_{\kappa,i};d_{i})+L_{\text{Con}}(%\phi_{i},\tau_{i},p_{\rho,i},h_{\rho,i};d_{i})\right\}.roman_min start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } .(6)

The second item in Eq.(6) focuses on training the contrastive learning prompts pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT, aiming to align the feature with the contrastive learning task and mitigate the impact of the classification task on the contrastive learning task. In this phase, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is updated solely by LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT.

Let R𝑅Ritalic_R represent the total number of local epochs in one training round. We divide into Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT epochs for the feature learning phase and Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT epochs for the task adaptation phase, where Rf+Ra=Rsubscript𝑅𝑓subscript𝑅𝑎𝑅R_{f}+R_{a}=Ritalic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_R. It is crucial that Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is always larger than Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to ensure: 1) ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is predominantly trained by the contrastive learning task, reducing the impact of the non-IID problem on collaboration in feature extraction; 2) Improved alignment of features utilized for classification at the client side with the global classifier, thereby achieving better alignment of training objectives across clients.

Upon completing local training, the parameters ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, hκ,isubscript𝜅𝑖h_{\kappa,i}italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT, and hρ,isubscript𝜌𝑖h_{\rho,i}italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT are aggregated at the server to foster client collaboration, while pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT and pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT remain locally. We simply adopt the aggregation method used in FedAvg. The pseudo-code of FedPFT is summarized in Algorithm1.

Input: Each client’s initial personalized prompts pκ,i(0)superscriptsubscript𝑝𝜅𝑖0p_{\kappa,i}^{(0)}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and pρ,i(0)superscriptsubscript𝑝𝜌𝑖0p_{\rho,i}^{(0)}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT; The initial global models {ϕ(0),τ(0),hκ(0),hρ(0)}superscriptitalic-ϕ0superscript𝜏0superscriptsubscript𝜅0superscriptsubscript𝜌0\{\phi^{(0)},\tau^{(0)},h_{\kappa}^{(0)},h_{\rho}^{(0)}\}{ italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT };Client Number N𝑁Nitalic_N;Total round T𝑇Titalic_T; Epochs of two learning phases Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

Output: Personalized model {ϕ(T),τ(T),hκ(T),pκ,i(T)}superscriptitalic-ϕ𝑇superscript𝜏𝑇superscriptsubscript𝜅𝑇superscriptsubscript𝑝𝜅𝑖𝑇\{\phi^{(T)},\tau^{(T)},h_{\kappa}^{(T)},p_{\kappa,i}^{(T)}\}{ italic_ϕ start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT } for each client.

fort=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1do

Client-side:

fori=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N in paralleldo

Initializing {ϕi(t),τi(t),hκ,i(t),hρ,i(t)}superscriptsubscriptitalic-ϕ𝑖𝑡superscriptsubscript𝜏𝑖𝑡superscriptsubscript𝜅𝑖𝑡superscriptsubscript𝜌𝑖𝑡\{\phi_{i}^{(t)},\tau_{i}^{(t)},h_{\kappa,i}^{(t)},h_{\rho,i}^{(t)}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } with {ϕ(t),τ(t),hκ(t),hρ(t)}superscriptitalic-ϕ𝑡superscript𝜏𝑡superscriptsubscript𝜅𝑡superscriptsubscript𝜌𝑡\{\phi^{(t)},\tau^{(t)},h_{\kappa}^{(t)},h_{\rho}^{(t)}\}{ italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT }.

Updating {ϕi(t),τi(t),pκ,i(t),hρ,i(t)}superscriptsubscriptitalic-ϕ𝑖𝑡superscriptsubscript𝜏𝑖𝑡superscriptsubscript𝑝𝜅𝑖𝑡superscriptsubscript𝜌𝑖𝑡\{\phi_{i}^{(t)},\tau_{i}^{(t)},p_{\kappa,i}^{(t)},h_{\rho,i}^{(t)}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } by Eq.(5) for Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT epochs to obtain {ϕi(t),τi(t),pκ,i(t+1),hρ,i(t+1)}superscriptsubscriptitalic-ϕ𝑖superscript𝑡superscriptsubscript𝜏𝑖superscript𝑡superscriptsubscript𝑝𝜅𝑖𝑡1superscriptsubscript𝜌𝑖𝑡1\{\phi_{i}^{(t^{\prime})},\tau_{i}^{(t^{\prime})},p_{\kappa,i}^{(t+1)},h_{\rho%,i}^{(t+1)}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT }.

Updating {ϕi(t),τi(t),pρ,i(t),hκ,i(t)}superscriptsubscriptitalic-ϕ𝑖superscript𝑡superscriptsubscript𝜏𝑖superscript𝑡superscriptsubscript𝑝𝜌𝑖𝑡superscriptsubscript𝜅𝑖𝑡\{\phi_{i}^{(t^{\prime})},\tau_{i}^{(t^{\prime})},p_{\rho,i}^{(t)},h_{\kappa,i%}^{(t)}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } by Eq.(6) for Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT epochs to obtain {ϕi(t+1),τi(t+1),pρ,i(t+1),hκ,i(t+1)}superscriptsubscriptitalic-ϕ𝑖𝑡1superscriptsubscript𝜏𝑖𝑡1superscriptsubscript𝑝𝜌𝑖𝑡1superscriptsubscript𝜅𝑖𝑡1\{\phi_{i}^{(t+1)},\tau_{i}^{(t+1)},p_{\rho,i}^{(t+1)},h_{\kappa,i}^{(t+1)}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT }.

Sending {ϕi(t+1),τi(t+1),hκ,i(t+1),hρ,i(t+1)}superscriptsubscriptitalic-ϕ𝑖𝑡1superscriptsubscript𝜏𝑖𝑡1superscriptsubscript𝜅𝑖𝑡1superscriptsubscript𝜌𝑖𝑡1\{\phi_{i}^{(t+1)},\tau_{i}^{(t+1)},h_{\kappa,i}^{(t+1)},h_{\rho,i}^{(t+1)}\}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT } to the server.

endfor

Server-side:

Aggregating a set of global model {ϕ(t+1),τ(t+1),hκ(t+1),hρ(t+1)}superscriptitalic-ϕ𝑡1superscript𝜏𝑡1superscriptsubscript𝜅𝑡1superscriptsubscript𝜌𝑡1\{\phi^{(t+1)},\tau^{(t+1)},h_{\kappa}^{(t+1)},h_{\rho}^{(t+1)}\}{ italic_ϕ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT }.

Sending {ϕ(t+1),τ(t+1),hκ(t+1),hρ(t+1)}superscriptitalic-ϕ𝑡1superscript𝜏𝑡1superscriptsubscript𝜅𝑡1superscriptsubscript𝜌𝑡1\{\phi^{(t+1)},\tau^{(t+1)},h_{\kappa}^{(t+1)},h_{\rho}^{(t+1)}\}{ italic_ϕ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT } to each client i𝑖iitalic_i.

endfor

4 Experiments

4.1 Experimental Setup

Datasets.

We employ three datasets for experimental validation: CIFAR-10 [14], CIFAR-100 [13], and Tiny ImageNet [15]. We utilize two scenarios: Dirichlet non-IID and Pathological non-IID.

In our experiments, each client is assigned 500 training samples. For CIFAR-10 and CIFAR-100 datasets, each client has 100 test samples; for the Tiny ImageNet dataset, each client has 200 test samples. Both training and test data have the same label distribution.

Baseline Methods. We compare our method against nine state-of-the-art (SOTA) methods: FedAMP [10], Fedper [2], FedRep [5], FedBN [20], FedRoD [4], pFedSD [12], pFedGate [3], FedCAC [35], and pFedPT [17]. These methods cover the advancements in mainstream PFL research directions.

Hyperparameter Settings. For the general hyperparameters of FL, we set the number of clients N=40𝑁40N=40italic_N = 40, Batch Size B=100𝐵100B=100italic_B = 100, and local update rounds R=5𝑅5R=5italic_R = 5. Across all datasets, we set the total rounds T=1000𝑇1000T=1000italic_T = 1000 in each experiment to ensure convergence and select the highest average accuracy achieved by all clients across all rounds as the result. Each experiment is repeated with three random seeds, and the mean and standard deviation are reported. We employ the ResNet [9] model architecture, specifically ResNet-8 for CIFAR-10 and ResNet-10 for CIFAR-100 and Tiny ImageNet.

For more details on the experimental setup, please refer to AppendixB.

4.2 Comparison with State-of-the-art Methods

CIFAR-100Tiny ImageNet
Methodsα=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0α=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0
FedAvg 34.91±plus-or-minus\pm±0.86 32.78±plus-or-minus\pm±0.23 33.94±plus-or-minus\pm±0.39 21.26±plus-or-minus\pm±1.28 20.32±plus-or-minus\pm±0.91 17.20±plus-or-minus\pm±0.54
Local 47.61±plus-or-minus\pm±0.96 22.65±plus-or-minus\pm±0.51 18.76±plus-or-minus\pm±0.63 24.07±plus-or-minus\pm±0.62   8.75±plus-or-minus\pm±0.30   6.87±plus-or-minus\pm±0.28
FedAMP 46.68±plus-or-minus\pm±1.06 24.74±plus-or-minus\pm±0.58 18.22±plus-or-minus\pm±0.41 27.85±plus-or-minus\pm±0.71 10.70±plus-or-minus\pm±0.32   7.13±plus-or-minus\pm±0.21
FedPer 51.38±plus-or-minus\pm±0.94 28.25±plus-or-minus\pm±1.03 21.53±plus-or-minus\pm±0.50 32.33±plus-or-minus\pm±0.31 12.69±plus-or-minus\pm±0.42   8.67±plus-or-minus\pm±0.40
FedRep 51.25±plus-or-minus\pm±1.37 26.97±plus-or-minus\pm±0.33 20.63±plus-or-minus\pm±0.42 30.83±plus-or-minus\pm±1.05 12.14±plus-or-minus\pm±0.28   8.37±plus-or-minus\pm±0.25
FedBN 54.35±plus-or-minus\pm±0.63 36.94±plus-or-minus\pm±0.94 33.67±plus-or-minus\pm±0.12 33.34±plus-or-minus\pm±0.71 19.61±plus-or-minus\pm±0.35 16.57±plus-or-minus\pm±0.44
FedRoD 60.17±plus-or-minus\pm±0.48 39.88±plus-or-minus\pm±1.18 36.80±plus-or-minus\pm±0.56 41.06±plus-or-minus\pm±0.77 25.63±plus-or-minus\pm±1.11 22.32±plus-or-minus\pm±1.13
pFedSD 54.14±plus-or-minus\pm±0.77 41.06±plus-or-minus\pm±0.83 38.27±plus-or-minus\pm±0.20 39.31±plus-or-minus\pm±0.19 19.25±plus-or-minus\pm±1.80 15.91±plus-or-minus\pm±0.33
pFedGate 48.54±plus-or-minus\pm±0.39 27.47±plus-or-minus\pm±0.79 22.98±plus-or-minus\pm±0.03 37.59±plus-or-minus\pm±0.39 24.09±plus-or-minus\pm±0.67 19.69±plus-or-minus\pm±0.14
FedCAC 57.22±plus-or-minus\pm±1.52 38.64±plus-or-minus\pm±0.63 32.59±plus-or-minus\pm±0.32 40.19±plus-or-minus\pm±1.20 23.70±plus-or-minus\pm±0.28 18.58±plus-or-minus\pm±0.62
pFedPT 43.21±plus-or-minus\pm±1.66 35.23±plus-or-minus\pm±0.87 36.25±plus-or-minus\pm±0.37 23.55±plus-or-minus\pm±0.68 22.35±plus-or-minus\pm±0.49 21.69±plus-or-minus\pm±0.24
FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT 60.98±plus-or-minus\pm±0.39 44.87±plus-or-minus\pm±0.76 41.83±plus-or-minus\pm±0.37 41.49±plus-or-minus\pm±0.10 28.61±plus-or-minus\pm±0.40 25.10±plus-or-minus\pm±0.59
FedPFT 62.03±plus-or-minus\pm±1.41 47.98±plus-or-minus\pm±0.78 44.29±plus-or-minus\pm±0.74 43.42±plus-or-minus\pm±1.62 32.44±plus-or-minus\pm±0.58 27.84±plus-or-minus\pm±0.41

In this section, we compare our proposed FedPFT with two baseline methods and nine SOTA methods across three datasets and two non-IID scenarios. We also introduce ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT,’ which solely addresses the mismatch problem without contrastive learning. The experimental results on CIFAR-100 and Tiny ImageNet in Dirichlet non-IID scenario are presented in Table2.Please refer to the AppendixC for experimental results in Pathological non-IID scenarios and the CIFAR-10 dataset.

Results in Dirichlet non-IID scenario.

In this setting, by varying α𝛼\alphaitalic_α, we can evaluate the performance of methods under different non-IID degrees. The results, as detailed in Table2, demonstrate that performance varies significantly depending on the underlying design principles of each method. Among all methods, FedRoD demonstrates robust performance across all datasets and non-IID degrees. This is attributed to its design of two classifiers: a personalized classifier for local feature alignment and a global classifier for assistance from other clients to improve generalization. ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT’ addresses the mismatch issue specifically and achieves competitive or superior results across all scenarios. FedPFT further improves feature extractor quality and outperforms SOTA methods significantly across all scenarios, achieving up to a 7.08% improvement.

4.3 Ablation Study

In this section, we validate the effectiveness of each component of FedPFT on the CIFAR-100 dataset under two non-IID degrees. The experimental results are illustrated in Table3.

α=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5
Settingspκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPTAlter.LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPTpρsubscript𝑝𝜌p_{\rho}italic_p start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPTAccuracy (%)pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPTAlter.LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPTpρsubscript𝑝𝜌p_{\rho}italic_p start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPTAccuracy (%)
I33.87±plus-or-minus\pm±1.3530.09±plus-or-minus\pm±0.31
II40.97±plus-or-minus\pm±1.2831.45±plus-or-minus\pm±1.35
III60.98±plus-or-minus\pm±0.3944.87±plus-or-minus\pm±0.76
IV61.13±plus-or-minus\pm±0.5047.67±plus-or-minus\pm±1.42
V62.03±plus-or-minus\pm±1.4147.98±plus-or-minus\pm±0.78
VI36.24±plus-or-minus\pm±1.1034.70±plus-or-minus\pm±1.33
VII53.17±plus-or-minus\pm±0.5838.90±plus-or-minus\pm±0.91
VIII53.76±plus-or-minus\pm±0.3539.29±plus-or-minus\pm±1.00

Setting I represents FedAvg. Setting II incorporates classification prompts pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT to allow each client to adjust the global model individually to obtain a personalized model, resulting in a performance improvement. Setting III incorporates alternating training, where prompts are firstly updated to align local features with the global classifier, followed by training model parameters. This approach essentially aligns training objectives among clients. This effectively mitigates the impact of non-IID data on model collaboration, thus further enhancing the quality of the global model.

Setting IV adds contrastive learning loss to Setting III, , focusing primarily on enhancing the feature extractor’s performance through contrastive learning techniques. Setting V incorporates specific prompts pρsubscript𝑝𝜌p_{\rho}italic_p start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT for the contrastive learning task. This reduces mutual interference between the two tasks during training, especially effective when non-IID is strong (e.g., α=0.1𝛼0.1\alpha=0.1italic_α = 0.1).

Setting VI illustrates that adding contrastive learning alone brings very limited improvements. Settings VII and VIII partially achieve feature-classifier alignment by introducing pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, greatly enhancing model performance. However, without using alternating training, local features cannot adapt well to the classifier. This leads to a significant performance gap between Settings VII, VIII, and Setting V.

This ablation study underlines the importance of each module in FedPFT. It confirms that aligning local features with the global classifier and enhancing the feature extractor’s quality are both crucial for optimizing model performance, aligning with the core motivations behind our methodology.

4.4 Separability of Features

In this section, we assess the effectiveness of FedPFT in enhancing the quality of the feature extractor by conducting linear probing experiments on CIFAR-10 and CIFAR-100. The results are shown in Table4. Higher accuracy means that the extracted features have better linear separability.

Compared to FedAvg, features extracted by ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT’ demonstrate superior linear separability. This improvement is attributable to the alignment of local features with the global classifier in the feature learning phase, which synchronizes client training objectives and mitigates the adverse effects of non-IID data. FedPFT further improves the quality of the feature extractor by integrating a collaborative contrastive learning task. For more experimental results, please refer to AppendixD.

CIFAR-10CIFAR-100
Methodsα=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0α=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0
FedAvg85.01%72.52%68.38%59.50%37.40%32.33%
FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT85.52%72.59%69.57%61.60%43.14%38.47%
FedPFT87.83%77.25%74.02%64.12%46.43%40.95%

4.5 Learned Features of Different Methods

In this subsection, we visually compare the quality of features extracted by different methods and highlight the impact of different modules in FedPFT on feature extraction. We conduct experiments on the CIFAR-10 dataset with 10 clients, each allocated 1000 training images and 500 testing images. The data distribution is shown in Figure2(a). For each method, we visualize the feature vectors of testing data from different clients using t-SNE [33]. The visualization results are depicted in Figure2(b)-(h), where colors represent different data classes, and markers represent different clients, as detailed in Figure2(a).

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (2)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (3)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (4)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (5)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (6)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (7)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (8)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (9)

FedAvg and FedCAC exhibit noticeable cluster structures of features but lack strong discriminative boundaries. FedPer displays overlapping features across various classes, attributable to the use of personalized classifiers that create different local feature spaces for each client. Consequently, data from different classes across different clients are mapped to similar positions. This interference between clients reduces the quality of the global feature extractor.

‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT’ shows clearer discriminative boundaries, which is attributed to the alignment of local features with the global classifier achieved during local training. We also observe that data from the same class across different clients are mapped to the same positions in the feature space, indicating that the global classifier provides a unified feature space for all clients. Adapting local features to this space essentially aligns the training objectives among clients in non-IID scenarios, promoting collaboration among clients. FedPFT further enhances feature separability by incorporating LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT.

‘FedPFT w/o Alter’ represents not using alternating training. While it shows better clustering than FedAvg, the discriminative quality of the boundaries is weaker compared to ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT.’ This configuration shows increased interference among client models, lacking alignment to the common global feature space. ‘FedPFT_p_classifier’ indicates using personalized classifiers. In this case, the feature space becomes highly scattered, similar to FedPer’s issue. Since we train prompt pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT to adapt to personalized classifiers first, this exacerbates the variability in feature spaces across clients

4.6 Effect of Different Prompts

In this section, we delve into the role of prompts in FedPFT. We visualize the features transformed by different prompts using t-SNE. The experimental setup is consistent with Section4.5. The results are depicted in Figure3. Larger markers in the figures represent feature centroids of corresponding classes for each client.

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (10)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (11)

It is evident that features obtained from classification prompts pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT are not significantly correlated with image similarity but rather with the distribution of client data. For example, two classes within a client are close together. Conversely, features transformed by contrastive learning prompts are more related to image similarity. For instance, in Figure3(b), the feature centroids of ‘cat’ and ‘dog’ are closer, as are ‘truck’ and ‘automobile,’ which aligns with the principles of contrastive learning.

CIFAR-10CIFAR-100
Prompt Typeα=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0α=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=0.5𝛼0.5\alpha=0.5italic_α = 0.5
None87.69%77.12%73.93%64.08%46.50%40.79%
pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT87.83%77.25%74.02%64.12%46.43%40.95%
pρsubscript𝑝𝜌p_{\rho}italic_p start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT87.82%77.25%74.02%64.18%46.40%40.95%

We also investigate whether different types of prompts influence feature separability. We conduct linear probe experiments using the CIFAR-10 and CIFAR-100 datasets. The results are detailed in Table5. In these experiments, we compare three conditions: ‘None’ (no prompts used), ‘pκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT’ (using classification prompts), and ‘pρsubscript𝑝𝜌p_{\rho}italic_p start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT’ (using contrastive learning prompts). Interestingly, the accuracies across different prompt conditions are generally similar, suggesting that the use of either type of prompt does not significantly impact the overall quality of the features extracted.

The above experiments demonstrate that prompts work by transforming features into the required format for downstream tasks using task-specific prompts. This also indicates the scalability and adaptability of our designed feature transformation module. It can incorporate various client-collaborative tasks beneficial for enhancing the performance of personalized models through task-specific prompts.

5 Conclusion and Discussion

We observe that the feature extractor from FedAvg surpasses those in most PFL methods, yet it suffers from inadequate performance due to a mismatch between the local features and the classifier. This mismatch issue not only impacts the performance during model inference but also affects the synergy between the feature extractor and the classifier during training. We propose a new PFL method called FedPFT with a prompt-driven feature transform module to address these issues during training. Our experiments demonstrate that FedPFT not only resolves the mismatch issue but also significantly improves the quality of the feature extractor, achieving substantial performance gains compared to state-of-the-art methods. We discuss the limitations and our future work in AppendixJ.

References

  • [1]D.A.E. Acar, Y.Zhao, R.Zhu, R.Matas, M.Mattina, P.Whatmough, and V.Saligrama.Debiasing model updates for improving personalized federated training.In International Conference on Machine Learning, pages 21–31. PMLR, 2021.
  • [2]M.G. Arivazhagan, V.Aggarwal, A.K. Singh, and S.Choudhary.Federated learning with personalization layers.arXiv preprint arXiv:1912.00818, 2019.
  • [3]D.Chen, L.Yao, D.Gao, B.Ding, and Y.Li.Efficient personalized federated learning via sparse model-adaptation.arXiv preprint arXiv:2305.02776, 2023.
  • [4]H.-Y. Chen and W.-L. Chao.On bridging generic and personalized federated learning for image classification.In International Conference on Learning Representations, 2022.
  • [5]L.Collins, H.Hassani, A.Mokhtari, and S.Shakkottai.Exploiting shared representations for personalized federated learning.pages 2089–2099, 2021.
  • [6]W.Deng, C.Thrampoulidis, and X.Li.Unlocking the potential of prompt-tuning in bridging generalized and personalized federated learning.arXiv e-prints, pages arXiv–2310, 2023.
  • [7]A.Fallah, A.Mokhtari, and A.Ozdaglar.Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach.Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
  • [8]K.He, H.Fan, Y.Wu, S.Xie, and R.Girshick.Momentum contrast for unsupervised visual representation learning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  • [9]K.He, X.Zhang, S.Ren, and J.Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [10]Y.Huang, L.Chu, Z.Zhou, L.Wang, J.Liu, J.Pei, and Y.Zhang.Personalized cross-silo federated learning on non-iid data.In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • [11]M.Jia, L.Tang, B.-C. Chen, C.Cardie, S.Belongie, B.Hariharan, and S.-N. Lim.Visual prompt tuning.In European Conference on Computer Vision, 2022.
  • [12]H.Jin, D.Bai, D.Yao, Y.Dai, L.Gu, C.Yu, and L.Sun.Personalized edge intelligence via federated self-knowledge distillation.IEEE Transactions on Parallel and Distributed Systems, 34(2):567–580, 2022.
  • [13]A.Krizhevsky, G.Hinton, etal.Learning multiple layers of features from tiny images.2009.
  • [14]A.Krizhevsky, V.Nair, and G.Hinton.Cifar-10 (canadian institute for advanced research).URL http://www. cs. toronto. edu/kriz/cifar. html, 5, 2010.
  • [15]Y.Le and X.Yang.Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015.
  • [16]B.Lester, R.Al-Rfou, and N.Constant.The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021.
  • [17]G.Li, W.Wu, Y.Sun, L.Shen, B.Wu, and D.Tao.Visual prompt based personalized federated learning.Transactions on Machine Learning Research, 2023.
  • [18]H.Li, W.Huang, J.Wang, and Y.Shi.Global and local prompts cooperation via optimal transport for federated learning.arXiv preprint arXiv:2403.00041, 2024.
  • [19]T.Li, S.Hu, A.Beirami, and V.Smith.Ditto: Fair and robust federated learning through personalization.In International Conference on Machine Learning, pages 6357–6368. PMLR, 2021.
  • [20]X.Li, M.JIANG, X.Zhang, M.Kamp, and Q.Dou.FedBN: Federated learning on non-IID features via local batch normalization.In International Conference on Learning Representations, 2021.
  • [21]Z.Li, X.Shang, R.He, T.Lin, and C.Wu.No fear of classifier biases: Neural collapse inspired federated learning with synthetic and fixed classifier.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5319–5329, 2023.
  • [22]P.P. Liang, T.Liu, L.Ziyin, N.B. Allen, R.P. Auerbach, D.Brent, R.Salakhutdinov, and L.-P. Morency.Think locally, act globally: Federated learning with local and global representations.arXiv preprint arXiv:2001.01523, 2020.
  • [23]H.Liu, C.Li, Q.Wu, and Y.J. Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024.
  • [24]X.Liu, K.Ji, Y.Fu, W.L. Tam, Z.Du, Z.Yang, and J.Tang.P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2110.07602, 2021.
  • [25]J.Luo and S.Wu.Adapt to adaptation: Learning personalization for cross-silo federated learning.In L.D. Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2166–2173. International Joint Conferences on Artificial Intelligence Organization, 7 2022.Main Track.
  • [26]X.Ma, J.Zhang, S.Guo, and W.Xu.Layer-wised model aggregation for personalized federated learning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10092–10101, 2022.
  • [27]B.McMahan, E.Moore, D.Ramage, S.Hampson, and B.A. yArcas.Communication-efficient learning of deep networks from decentralized data.In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
  • [28]J.Mills, J.Hu, and G.Min.Multi-task federated learning for personalised deep neural networks in edge computing.IEEE Transactions on Parallel and Distributed Systems, 33(3):630–641, 2021.
  • [29]M.Shi, Y.Zhou, K.Wang, H.Zhang, S.Huang, Q.Ye, and J.Lv.Prior: Personalized prior for reactivating the information overlooked in federated learning.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, editors, Advances in Neural Information Processing Systems, volume36, pages 28378–28392. Curran Associates, Inc., 2023.
  • [30]S.Su, M.Yang, B.Li, and X.Xue.Federated adaptive prompt tuning for multi-domain collaborative learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pages 15117–15125, 2024.
  • [31]B.Sun, H.Huo, Y.Yang, and B.Bai.Partialfed: Cross-domain personalized federated learning via partial initialization.Advances in Neural Information Processing Systems, 34:23309–23320, 2021.
  • [32]C.TDinh, N.Tran, and J.Nguyen.Personalized federated learning with moreau envelopes.Advances in Neural Information Processing Systems, 33:21394–21405, 2020.
  • [33]L.Vander Maaten and G.Hinton.Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008.
  • [34]L.Wang, K.Zhang, Y.Li, Y.Tian, and R.Tedrake.Does learning from decentralized non-IID unlabeled data benefit from self supervision?In The Eleventh International Conference on Learning Representations, 2023.
  • [35]X.Wu, X.Liu, J.Niu, G.Zhu, and S.Tang.Bold but cautious: Unlocking the potential of personalized federated learning through cautiously aggressive collaboration.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19375–19384, October 2023.
  • [36]X.Wu, J.Niu, X.Liu, T.Ren, Z.Huang, and Z.Li.pfedgf: Enabling personalized federated learning via gradient fusion.In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 639–649. IEEE, 2022.
  • [37]F.-E. Yang, C.-Y. Wang, and Y.-C.F. Wang.Efficient model personalization in federated learning via client-specific prompt generation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19159–19168, 2023.
  • [38]J.Zhang, Y.Hua, H.Wang, T.Song, Z.Xue, R.Ma, J.Cao, and H.Guan.Gpfl: Simultaneously learning global and personalized feature information for personalized federated learning.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5041–5051, October 2023.
  • [39]G.Zhu, X.Liu, S.Tang, and J.Niu.Aligning before aggregating: Enabling communication efficient cross-domain federated learning via consistent feature extraction.IEEE Transactions on Mobile Computing, 23(5):5880–5896, 2024.

Appendix A Related Work

Current PFL methods can primarily be categorized into several major types: meta-learning-based methods [7; 1], model-regularization-based methods [32; 19], fine-tuning-based methods [12; 3; 21], personalized-weight-aggregation-based methods [10; 25], and parameter-decoupling-based methods. This paper delves into the issues inherent in the global model of FedAvg and primarily discusses parameter-decoupling methods that rely on the global model.

In addition to the aforementioned methods, a new category based on prompts has recently emerged.

Prompt-based methods.

Recently, prompt technology has garnered widespread attention in the fields of computer vision [11; 23] and natural language processing [16; 24]. This technology involves using prompts as inputs to guide the behavior or output of models, typically for fine-tuning purposes. The domain of PFL has also seen the emergence of prompt-based approaches. Most of these are based on pre-trained models, aiming to train prompts to fine-tune the pre-trained models to fit client-local data, as seen in pFedPG [37], SGPT [6], FedOTP [18], and FedAPT [30]. pFedPT [17] trains both the model and prompts, using prompts at the input level to learn personalized knowledge for fine-tuning the global model to adapt to the client’s local distributions. Our FedPFT fundamentally differs from these methods in its training objective. Rather than fine-tuning, we introduce prompts to guide feature transformations to align with the global classifier, thereby addressing the mismatch issue inherent in the global model during the training process.

Appendix B Experiment Setup

B.1 Introduction to non-IID Scenarios

Pathological non-IID.

In this setting, each client is randomly assigned data from a subset of classes with equal data volume per class. For the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, we assign 2, 20, and 40 classes of data to each client, respectively.

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (12)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (13)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (14)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (15)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (16)

Dirichlet non-IID.

This is a commonly used setting in current FL research [36; 35; 29] In this scenario, the data for each client is generated from a Dirichlet distribution denoted as Dir(α)𝐷𝑖𝑟𝛼Dir(\alpha)italic_D italic_i italic_r ( italic_α ). As the value of α𝛼\alphaitalic_α increases, the class imbalance within each client’s dataset progressively decreases. This Dirichlet non-IID setting enables the evaluation of different methods across a broad spectrum of non-IID conditions, reflecting various degrees of data heterogeneity.

For a clearer, more intuitive understanding, we involve 20 clients with 10-class and 50-class datasets to visualize the data distribution among clients with varying α𝛼\alphaitalic_α values. As depicted in Figure 4, the horizontal axis labels the data class indices, while the vertical axis lists the client IDs. Each red dot indicates the class data assigned to a client, with larger dots signifying a higher volume of data in that class.

B.2 Introduction to Comparative Methods

FedAMP [10] is a weighted-aggregation-based method where clients with similar data distributions are given higher aggregation weights during model aggregation. Because it mainly encourages the collaboration of clients with similar data distribution, it is a method that pays more attention to the local data distribution of clients from the design point of view. FedPer [2], FedRep [5], FedBN [20], FedRoD [4], and FedCAC [35] are parameter-decoupling-based methods, which personalize the global model by retaining certain parameters locally based on FedAvg. FedRoD additionally introduces a balanced global classifier to obtain assistance from other clients, alleviating the overfitting issue caused by personalized classifiers alone. pFedSD [12] and pFedGate [3] are fine-tuning-based methods that adapt the global model to local data through fine-tuning. pFedSD directly fine-tunes the global model by distilling local models, while pFedGate trains an additional gating network and applies it to the global model. pFedPT [17], a prompt-based method, can also be viewed as a fine-tuning approach, enhancing the global model’s adaptation to local data distributions by adding prompts to images.

B.3 Hyperparameter Settings in Different Methods

For the unique hyperparameters of each baseline method, we utilize the optimal parameter combinations reported in their respective papers. For learning rates, we adjust within {1e-1, 1e-2, 1e-3}.

In FedPFT, to simplify the hyperparameter tuning process and enhance the method’s usability, we provide a default set of hyperparameters: for all scenarios, we set (nκ,nρ)=(10,20)subscript𝑛𝜅subscript𝑛𝜌1020(n_{\kappa},n_{\rho})=(10,20)( italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) = ( 10 , 20 ). We use the SGD optimizer, with a learning rate of 0.01 for the feature transformation module and 0.1 for others. In the Dirichlet non-IID scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 scenario, we set (Rf,Ra)=(3,2)subscript𝑅𝑓subscript𝑅𝑎32(R_{f},R_{a})=(3,2)( italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = ( 3 , 2 ), while in other scenarios, we set (Rf,Ra)=(4,1)subscript𝑅𝑓subscript𝑅𝑎41(R_{f},R_{a})=(4,1)( italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = ( 4 , 1 ). For the contrastive learning algorithm, we adopt the default settings from MoCo. In ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT,’ we set the learning rate of the feature transformation module to 0.05 while keeping other hyperparameters the same as FedPFT. Unless otherwise specified, our experiments use the above hyperparameter settings, although fine-tuning these parameters for different scenarios may yield better performance.

B.4 Compute Resources

All the experiments are implemented using PyTorch and conducted on NVIDIA V100 GPUs. For the methods we compared, as well as ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT,’ a single training session requires 24-48 hours. For FedPFT, the training process takes longer due to the use of the MoCo algorithm, which requires data augmentation that can only be executed on the CPU. Consequently, a single training session for FedPFT requires 48-72 hours.

Appendix C Comparison with State-of-the-art Methods

We present the comparative results of FedPFT against established methods on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets under Pathological non-IID scenarios, as well as CIFAR-10 under Dirichlet non-IID scenarios in Tables6 and 7.

MethodsCIFAR-10CIFAR-100Tiny ImageNet
FedAvg54.33 ±plus-or-minus\pm± 3.0334.27 ±plus-or-minus\pm± 0.4418.05 ±plus-or-minus\pm± 0.23
Local85.85 ±plus-or-minus\pm± 0.9338.40 ±plus-or-minus\pm± 0.6916.20 ±plus-or-minus\pm± 0.30
FedAMP88.88 ±plus-or-minus\pm± 0.8338.36 ±plus-or-minus\pm± 0.7916.13 ±plus-or-minus\pm± 0.55
FedPer87.51 ±plus-or-minus\pm± 0.9541.54 ±plus-or-minus\pm± 0.7420.25 ±plus-or-minus\pm± 0.65
FedRep87.10 ±plus-or-minus\pm± 0.9140.63 ±plus-or-minus\pm± 0.7419.24 ±plus-or-minus\pm± 0.33
FedBN87.02 ±plus-or-minus\pm± 1.4147.75 ±plus-or-minus\pm± 1.0324.91 ±plus-or-minus\pm± 0.48
FedRoD88.06 ±plus-or-minus\pm± 1.7052.55 ±plus-or-minus\pm± 0.9232.25 ±plus-or-minus\pm± 0.80
pFedSD89.97 ±plus-or-minus\pm± 1.4552.30 ±plus-or-minus\pm± 1.1830.27 ±plus-or-minus\pm± 0.78
pFedGate89.15 ±plus-or-minus\pm± 0.7643.73 ±plus-or-minus\pm± 0.1422.42 ±plus-or-minus\pm± 0.83
FedCAC89.77 ±plus-or-minus\pm± 1.1449.07 ±plus-or-minus\pm± 0.8730.83 ±plus-or-minus\pm± 0.42
pFedPT86.29 ±plus-or-minus\pm± 1.1139.92 ±plus-or-minus\pm± 0.3321.38 ±plus-or-minus\pm± 0.98
FedPFT w/o LConsuperscript𝐿𝐶𝑜𝑛L^{Con}italic_L start_POSTSUPERSCRIPT italic_C italic_o italic_n end_POSTSUPERSCRIPT89.67 ±plus-or-minus\pm± 1.9657.62 ±plus-or-minus\pm± 1.1836.13 ±plus-or-minus\pm± 1,32
FedPFT90.55 ±plus-or-minus\pm± 1.3558.14 ±plus-or-minus\pm± 0.7137.59 ±plus-or-minus\pm± 0.39
Methodsα=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0
FedAvg 60.39 ±plus-or-minus\pm± 1.46 60.41 ±plus-or-minus\pm± 1.36 60.91 ±plus-or-minus\pm± 0.72
Local 81.91 ±plus-or-minus\pm± 3.09 60.15 ±plus-or-minus\pm± 0.86 52.24 ±plus-or-minus\pm± 0.41
FedAMP 84.99 ±plus-or-minus\pm± 1.82 68.26 ±plus-or-minus\pm± 0.79 64.87 ±plus-or-minus\pm± 0.95
FedPer 84.43 ±plus-or-minus\pm± 0.47 68.80 ±plus-or-minus\pm± 0.49 64.92 ±plus-or-minus\pm± 0.66
FedRep 84.59 ±plus-or-minus\pm± 1.58 67.69 ±plus-or-minus\pm± 0.86 60.52 ±plus-or-minus\pm± 0.72
FedBN 83.55 ±plus-or-minus\pm± 2.32 66.79 ±plus-or-minus\pm± 1.08 62.20 ±plus-or-minus\pm± 0.67
FedRoD 86.23 ±plus-or-minus\pm± 2.12 72.34 ±plus-or-minus\pm± 1.77 68.45 ±plus-or-minus\pm± 1.94
pFedSD 86.34 ±plus-or-minus\pm± 2.61 71.97 ±plus-or-minus\pm± 2.07 67.21 ±plus-or-minus\pm± 1.89
pFedGate 87.25 ±plus-or-minus\pm± 1.91 71.98 ±plus-or-minus\pm± 1.61 67.85 ±plus-or-minus\pm± 0.87
FedCAC 86.82 ±plus-or-minus\pm± 1.18 69.83 ±plus-or-minus\pm± 0.46 65.39 ±plus-or-minus\pm± 0.51
pFedPT 82.38 ±plus-or-minus\pm± 2.91 67.33 ±plus-or-minus\pm± 1.33 64.37 ±plus-or-minus\pm± 1.22
FedPFT w/o LConsuperscript𝐿𝐶𝑜𝑛L^{Con}italic_L start_POSTSUPERSCRIPT italic_C italic_o italic_n end_POSTSUPERSCRIPT 87.23 ±plus-or-minus\pm± 2.69 74.10 ±plus-or-minus\pm± 1.95 69.23 ±plus-or-minus\pm± 0.76
FedPFT 88.60 ±plus-or-minus\pm± 2.19 77.54 ±plus-or-minus\pm± 1.88 74.81 ±plus-or-minus\pm± 0.77

Results in Pathological non-IID scenario.

This is an extreme setting where each client has data from only a subset of classes. This scenario is particularly pronounced in the CIFAR-10 dataset, where each client essentially performs a simple binary classification task. Here, clients can achieve decent performance by solely focusing on their local tasks (‘Local’), even without collaboration with other clients. As such, methods that prioritize local data distribution, such as FedAMP, pFedSD, and pFedGate, perform well. In contrast, on CIFAR-100 and Tiny ImageNet datasets, as clients have more local classes with fewer samples per class, local tasks become more challenging. Effective collaboration with other clients becomes crucial. Consequently, methods such as FedRoD, which emphasize client collaboration, exhibit increasingly significant performance. FedAMP and pFedGate show considerable performance degradation. FedPer, FedRep, FedBN, and FedCAC, by personalizing certain parameters of FedAvg, enhance local performance by indirectly aligning local features with classifiers to some extent. However, as they do not address the mismatch issue, they compromise the performance of feature extractors to some extent, thereby limiting their performance to a moderate level across the three datasets. ‘FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT’ aligns local features with the global feature space using classification prompts, enhancing both local feature-classifier alignment and inter-client collaboration effectiveness. It achieves competitive performance on CIFAR-10 and surpasses existing SOTA methods on CIFAR-100 and Tiny ImageNet. FedPFT further incorporates contrastive learning tasks to enhance feature extractor performance, outperforming SOTA methods significantly across all datasets.

Appendix D Feature Separability of Different Methods

CIFAR-10CIFAR-100
Methodsα=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0α=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5α=1.0𝛼1.0\alpha=1.0italic_α = 1.0
FedAvg85.01%72.52%68.38%59.50%37.40%32.33%
FedPer84.44%71.07%66.51%52.09%26.61%20.51%
FedBN84.52%70.15%66.51%57.86%35.24%30.28%
FedCAC85.22%71.56%66.98%56.86%34.64%29.35%
FedRoD82.79%67.07%63.12%56.88%33.99%29.22%
pFedSD85.86%72.42%68.12%60.07%37.33%31.99%
FedPFT w/o LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT85.52%72.59%69.57%61.60%43.14%38.47%
FedPFT87.83%77.25%74.02%64.12%46.43%40.95%

In this section, we delve deeper into the linear separability of features extracted by various PFL methods. Linear separability is a critical measure of feature quality, indicating the ability of a model to distinguish between classes using simple linear classifiers. We conduct linear probing experiments on the CIFAR-10 and CIFAR-100 datasets to assess this metric, with results detailed in Table8.

It can be observed that the feature linear separability of most PFL methods is inferior to FedAvg. This indicates that although they partially alleviate the mismatch issue and achieve better model performance, the quality of the feature extractor is inevitably compromised due to their design, constraining the full potential of PFL.

In stark contrast, FedPFT significantly improves the linear separability of features compared to FedAvg. Our method accomplishes this by fundamentally addressing the mismatch issue during the training process rather than merely adapting the model post hoc. This proactive approach ensures that the feature extractor not only aligns more closely with the global classifier but also preserves its ability to generalize across diverse data distributions. Consequently, FedPFT enhances both the performance and the utility of the feature extractor.

Appendix E Comparison with Two-stage Approach

In FedPFT, we propose using a feature transformation module to coordinate the joint training of contrastive learning and classification tasks. To illustrate the superiority of this design, we introduce a baseline called ‘Two-stage,’ similar to [34], where contrastive learning training is conducted first, followed by classification task training after convergence. For fairness, in the two-stage method, we first perform 1000 rounds of contrastive learning training, followed by 1000 rounds of classification task training. The experimental results are depicted in Figure5.

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (17)(a) LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT

Methodsα=0.1𝛼0.1\alpha=0.1italic_α = 0.1α=0.5𝛼0.5\alpha=0.5italic_α = 0.5Two-stage 53.43 43.87Ours 62.03 47.98(b) Accuracy (%)

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (18)(c) LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT

Firstly, from the perspective of the contrastive learning loss (LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT), FedPFT registers lower loss values compared to the Two-stage approach, suggesting that simultaneous training with the classification task enhances the efficacy of contrastive learning. Secondly, considering both Figure5(b) and Figure5(c), our method exhibits significantly higher accuracy compared to the Two-stage approach. However, LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT converges to a higher training loss value, suggesting that in our design, contrastive learning tasks can alleviate overfitting issues in the classification task during training. These experiments demonstrate that our proposed approach can effectively coordinate both tasks, allowing them to assist each other. Importantly, these experiments also indicate that the significant performance improvement brought by contrastive learning in our method is largely attributed to the design of our feature transformation module and training approach.

Appendix F Attention Weight Visualization

In the feature transformation module of FedPFT, self-attention mechanisms are employed to facilitate the integration of prompts with sample features. This section visualizes the attention weights to reveal how prompts influence the transformation process. We analyze 20 test samples from a single client on the CIFAR-10 dataset, with results depicted in Figure6. Each row in the figure corresponds to the attention weights for the output feature fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of a single sample. Columns represent the input dimensions of the transformation module: the first column corresponds to the original input feature f𝑓fitalic_f, while subsequent columns relate to different prompts from the sets pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT or pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT.

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (19)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (20)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (21)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (22)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (23)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (24)

It can be observed that when α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, indicating severe local class imbalances, each client has data from only a few classes. In this case, the feature transformation task is relatively simple, and the influence of different prompts on a sample is similar. As α𝛼\alphaitalic_α increases, indicating more complex local tasks, the influence of prompts becomes more intricate. Particularly at α=1.0𝛼1.0\alpha=1.0italic_α = 1.0, it can be seen that each sample is affected differently by different prompts. This also indicates that our approach performs sample-level feature transformation.

Appendix G Partial Client Participation

In FL, challenges such as offline clients and unstable communication may result in only a subset of clients participating in training each round, posing a challenge to the robustness of FL algorithms. In this section, we investigate whether FedPFT is robust to this issue. We conduct experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet, considering scenarios where only a random 50%, 70%, and 90% of clients participate in training each round. The experimental results are presented in Table9.

Datasets100%90%70%50%
CIFAR-1088.60±plus-or-minus\pm±2.1988.50±plus-or-minus\pm±2.01 (-0.10)88.57±plus-or-minus\pm±2.53 (-0.03)88.69±plus-or-minus\pm±1.83 (+0.09)
CIFAR-10062.03±plus-or-minus\pm±1.4161.65±plus-or-minus\pm±0.41 (-0.38)63.54±plus-or-minus\pm±0.55 (+1.51)63.97±plus-or-minus\pm±0.10 (+1.94)
Tiny43.42±plus-or-minus\pm±1.6243.03±plus-or-minus\pm±2.09 (-0.39)44.59±plus-or-minus\pm±1.19 (+1.17)45.81±plus-or-minus\pm±1.02 (+2.39)

It can be observed that compared to scenarios where all clients participate in training, FedPFT’s accuracy is not significantly reduced when only a subset of clients participate. Furthermore, in CIFAR-100 and Tiny ImageNet, the performance of FedPFT may even be improved. This is because reducing the number of participating clients each round may mitigate the impact of non-IID data distribution on the global model. These experiments demonstrate the robustness of FedPFT to scenarios where only a subset of clients participate.

Appendix H Effect of Hyperparameters

In the previous experiments, we utilize the default hyperparameter combination. In this section, we verify how variations in these hyperparameters influence the performance of FedPFT.

H.1 Effect of nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT and nρsubscript𝑛𝜌n_{\rho}italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (25)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (26)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (27)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (28)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (29)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (30)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (31)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (32)

nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT and nρsubscript𝑛𝜌n_{\rho}italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT respectively represent the number of prompts in pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT and pρ,isubscript𝑝𝜌𝑖p_{\rho,i}italic_p start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT for each client. We examine the impact of these two hyperparameters on the performance of FedPFT on CIFAR-10 and CIFAR-100 datasets. When assessing the effect of nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, we hold nρ=20subscript𝑛𝜌20n_{\rho}=20italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT = 20 constant. Similarly, when evaluating the impact of nρsubscript𝑛𝜌n_{\rho}italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT is fixed at 10. The experimental results are depicted in Figure7 and Figure8.

FedPFT shows considerable robustness to variations in these hyperparameters. On the CIFAR-10 dataset, changes in nκsubscript𝑛𝜅n_{\kappa}italic_n start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT and nρsubscript𝑛𝜌n_{\rho}italic_n start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT have minimal impact on performance, suggesting that the model can effectively handle simpler data distributions even with fewer prompts. In contrast, on the more complex CIFAR-100 dataset, performance is initially limited by a smaller number of prompts, which may not sufficiently cover the diverse feature space required for effective feature transformation. As the number of prompts increases, the model’s ability to transform and adapt features improves, leading to enhanced performance.

H.2 Effect of Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT

Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are used to control the number of training epochs for the two training stages. Since we set Rf+Ra=Rsubscript𝑅𝑓subscript𝑅𝑎𝑅R_{f}+R_{a}=Ritalic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_R, in this experiment, we only adjust Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to examine the impact of these two hyperparameters on model performance. The experimental results are illustrated in Figure9.

Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (33)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (34)
Tackling Feature-Classifier Mismatch in Federated Learning via Prompt-Driven Feature Transformation (35)

When Rf=0subscript𝑅𝑓0R_{f}=0italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0, it indicates that contrastive learning is not used to train the feature extractor, and local features are not aligned with the global classifier before training model parameters. It can be observed that the model performance is very poor under this condition. As Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT gradually increases, the model performance shows a trend of initially increasing and then decreasing. This suggests that Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT essentially balances the trade-off between the two training stages. When Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is small, the feature extractor is predominantly trained by LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT, and the classifier undergoes more training epochs. At this point, the model pays more attention to the local data distribution of clients, but collaboration among clients is also more susceptible to non-IID effects. Conversely, when Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is large, the feature extractor is primarily trained by LConsubscript𝐿ConL_{\text{Con}}italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT, focusing more on general features, and collaboration among clients is less affected by non-IID issues. However, because the model is rarely trained with LCEsubscript𝐿CEL_{\text{CE}}italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT, it pays less attention to the local data distribution of clients, resulting in poorer performance on local data.

In general, Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are two hyperparameters that need to be carefully adjusted, as they have a significant impact on the performance of FedPFT. Typically, in scenarios where the local tasks of clients are simple, it may be appropriate to decrease the value of Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. In other cases, we recommend using a larger value of Rfsubscript𝑅𝑓R_{f}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to enhance the degree of collaboration among clients.

Appendix I Communication Cost

In this section, we calculate the communication overhead of one client in FedAvg and FedPFT in each communication round.

Modelϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTτisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPThκ,isubscript𝜅𝑖h_{\kappa,i}italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPThρ,isubscript𝜌𝑖h_{\rho,i}italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPTFedAvgFedPFTIncre. Ratio
ResNet-81.24M0.26M25.70K32.90K1.27M1.56M23.14%
ResNet-104.91M1.05M51.30K65.66K4.96M6.08M22.49%

In FedAvg, each communication round involves uploading the feature extractor ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the classifier hκ,isubscript𝜅𝑖h_{\kappa,i}italic_h start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT. FedPFT adds the feature transformation module τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the feature projection layer hρ,isubscript𝜌𝑖h_{\rho,i}italic_h start_POSTSUBSCRIPT italic_ρ , italic_i end_POSTSUBSCRIPT, thereby increasing the volume of parameters transmitted per round. According to the results presented in Table10, the communication overhead for FedPFT using ResNet-8 and ResNet-10 architectures is increased by 23.14% and 22.49%, respectively, relative to FedAvg.

While FedPFT brings additional communication cost, it is important to weigh it against the performance enhancements and scalability offered by τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as discussed in earlier sections of this paper. The improved model accuracy and robustness to non-IID data might justify the additional costs in scenarios where model performance is critical.

Moving forward, considering the increase in communication cost is primarily due to the additional components τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we aim to develop a more efficient and lightweight feature transformation module to reduce communication demands without compromising model effectiveness in our future work.

Appendix J Limitations and Future Work

In this paper, we primarily investigate PFL methods that derive personalized models based on a global model. We analyze the essential reasons these methods enhance performance from the perspective of mismatches between local features and classifiers. Although such methods occupy the mainstream in the current PFL field, it is necessary to admit that there are some PFL methods that are not based on global models, such as personalized-weight-aggregation-based methods, which are not explored in this study. Additionally, while this paper observes that personalizing a subset of parameters degrades the quality of the feature extractor, the underlying reasons for this phenomenon require further investigation.

Appendix K Theoretical Analysis

Since the main problem in Eq.(1) is non-convex, we focus on the factors affecting convergence in the non-convex setting.

ImplicationNotation
Global / Local lossL𝐿Litalic_L / Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Global / Local problemF𝐹Fitalic_F / Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Local Dataset on ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT clientd~idisubscript~𝑑𝑖subscript𝑑𝑖\tilde{d}_{i}\in d_{i}over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Feature extractorϕitalic-ϕ\phiitalic_ϕ
Feature transformation moduleτ𝜏\tauitalic_τ
Classification / Contrastive learning promptspκsubscript𝑝𝜅p_{\kappa}italic_p start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT / pρsubscript𝑝𝜌p_{\rho}italic_p start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT
Feature extractor & Feature transformation module & Classifierw𝑤witalic_w
Classification / Contrastive learningtask headhκsubscript𝜅h_{\kappa}italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT/hρsubscript𝜌h_{\rho}italic_h start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT
Global / Local problem’s gradientF(w)𝐹𝑤\nabla F(w)∇ italic_F ( italic_w ) / Fi(w)subscript𝐹𝑖𝑤\nabla F_{i}(w)∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w )
Local gradient approximationgi,rtsuperscriptsubscript𝑔𝑖𝑟𝑡g_{i,r}^{t}italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
Client numberN𝑁Nitalic_N
Local update epochR𝑅Ritalic_R
The number of clients sampled at each global epochS𝑆Sitalic_S
The set of clients sampled at global epoch t𝑡titalic_t𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
The actual learning rate of global problemη~~𝜂\tilde{\eta}over~ start_ARG italic_η end_ARG
The learning rate of local problemη𝜂\etaitalic_η
Approximated local gradient error’s upper-boundδ𝛿\deltaitalic_δ
Local-global gradient error’s upper-boundσFsubscript𝜎𝐹\sigma_{F}italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
Index of client, local epoch and global epochi[N],r[R],t[T]formulae-sequence𝑖delimited-[]𝑁formulae-sequence𝑟delimited-[]𝑅𝑡delimited-[]𝑇i\in[N],r\in[R],t\in[T]italic_i ∈ [ italic_N ] , italic_r ∈ [ italic_R ] , italic_t ∈ [ italic_T ]

K.1 Problem Setup

Non-convex case analyses are as follows. By Lagrange duality, the main problem is transformed as follows:

minϕ,τ,hκsubscriptitalic-ϕ𝜏subscript𝜅\displaystyle\min_{\phi,\tau,h_{\kappa}}roman_min start_POSTSUBSCRIPT italic_ϕ , italic_τ , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT end_POSTSUBSCRIPTmin{pκ,i}i[N]𝐄i𝐄diLCE(ϕ,τ,hκ,pκ,i;di)subscriptsubscriptsubscript𝑝𝜅𝑖𝑖delimited-[]𝑁subscript𝐄𝑖subscript𝐄subscript𝑑𝑖subscript𝐿CEitalic-ϕ𝜏subscript𝜅subscript𝑝𝜅𝑖subscript𝑑𝑖\displaystyle\min_{\{p_{\kappa,i}\}_{i\in[N]}}\mathbf{E}_{i}\mathbf{E}_{d_{i}}%L_{\text{CE}}(\phi,\tau,h_{\kappa},p_{\kappa,i};d_{i})roman_min start_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_ϕ , italic_τ , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
s.t.𝐄i𝐄diLCon(ϕ,τ;di)HCons.t.subscript𝐄𝑖subscript𝐄subscript𝑑𝑖subscript𝐿Conitalic-ϕ𝜏subscript𝑑𝑖subscript𝐻Con\displaystyle\text{s.t. }\mathbf{E}_{i}\mathbf{E}_{d_{i}}L_{\text{Con}}(\phi,%\tau;d_{i})\leq H_{\text{Con}}s.t. bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT ( italic_ϕ , italic_τ ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_H start_POSTSUBSCRIPT Con end_POSTSUBSCRIPT

We transform the problem into an unconditional bi-level optimization problem:

minw𝐄F(w)=𝐄i{Fi(w):=minpκ,i𝐄diLCE(w,pκ,i;di)}subscript𝑤𝐄𝐹𝑤subscript𝐄𝑖assignsubscript𝐹𝑖𝑤subscriptsubscript𝑝𝜅𝑖subscript𝐄subscript𝑑𝑖subscript𝐿CE𝑤subscript𝑝𝜅𝑖subscript𝑑𝑖\displaystyle\min_{w}\mathbf{E}F(w)=\mathbf{E}_{i}\{F_{i}(w):=\min_{p_{\kappa,%i}}\mathbf{E}_{d_{i}}L_{\text{CE}}(w,p_{\kappa,i};d_{i})\}roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT bold_E italic_F ( italic_w ) = bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) := roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_w , italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }

where 𝐄𝐄\mathbf{E}bold_E represents the expectation of all random variables, 𝐄isubscript𝐄𝑖\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the expectation of client sampling, 𝐄disubscript𝐄subscript𝑑𝑖\mathbf{E}_{d_{i}}bold_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the local data sampling expectation, and we use w={ϕ,τ,hκ}𝑤italic-ϕ𝜏subscript𝜅w=\{\phi,\tau,h_{\kappa}\}italic_w = { italic_ϕ , italic_τ , italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT } for simplification, based on the equivalence of block coordinate descent and gradient descent.

K.2 Propositions

Proposition K.1 (L𝐿Litalic_L-smooth).

If f𝑓fitalic_f is L𝐿Litalic_L-smooth, x,yfor-all𝑥𝑦\forall x,y∀ italic_x , italic_y we have:

f(x)f(y),xy𝑓𝑥𝑓𝑦𝑥𝑦\displaystyle\langle\nabla f(x)-\nabla f(y),x-y\rangle⟨ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) , italic_x - italic_y ⟩Lxy2absent𝐿superscriptnorm𝑥𝑦2\displaystyle\leq L||x-y||^{2}≤ italic_L | | italic_x - italic_y | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
f(x)f(y)norm𝑓𝑥𝑓𝑦\displaystyle||\nabla f(x)-\nabla f(y)||| | ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) | |Lxyabsent𝐿norm𝑥𝑦\displaystyle\leq L||x-y||≤ italic_L | | italic_x - italic_y | |
f(x)f(y)2superscriptnorm𝑓𝑥𝑓𝑦2\displaystyle||\nabla f(x)-\nabla f(y)||^{2}| | ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT2L[f(x)f(y)]absent2𝐿delimited-[]𝑓𝑥𝑓𝑦\displaystyle\leq 2L[f(x)-f(y)]≤ 2 italic_L [ italic_f ( italic_x ) - italic_f ( italic_y ) ]
f(y)f(x)f(x),yx𝑓𝑦𝑓𝑥𝑓𝑥𝑦𝑥\displaystyle f(y)-f(x)-\langle\nabla f(x),y-x\rangleitalic_f ( italic_y ) - italic_f ( italic_x ) - ⟨ ∇ italic_f ( italic_x ) , italic_y - italic_x ⟩L2yx2absent𝐿2superscriptnorm𝑦𝑥2\displaystyle\leq\frac{L}{2}||y-x||^{2}≤ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG | | italic_y - italic_x | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Proposition K.2 (Jensen’s inequality).

If f𝑓fitalic_f is convex, we have the following inequality:

𝐄Xf(X)f(𝐄XX).subscript𝐄𝑋𝑓𝑋𝑓subscript𝐄𝑋𝑋\displaystyle\mathbf{E}_{X}f(X)\geq f(\mathbf{E}_{X}X).bold_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_X ) ≥ italic_f ( bold_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_X ) .

A variant of the general one shown above, given a group {xi}i[N]subscriptsubscript𝑥𝑖𝑖delimited-[]𝑁\{x_{i}\}_{i\in[N]}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT:

i[N]xi2Ni[N]xi2.superscriptnormsubscript𝑖delimited-[]𝑁subscript𝑥𝑖2𝑁subscript𝑖delimited-[]𝑁superscriptnormsubscript𝑥𝑖2\displaystyle||\sum_{i\in[N]}x_{i}||^{2}\leq N\sum_{i\in[N]}||x_{i}||^{2}.| | ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_N ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Proposition K.3 (Triangle inequality).

The triangle inequality, where ||||||\cdot||| | ⋅ | | is the norm, and A𝐴Aitalic_A, B𝐵Bitalic_B is the elements in the corresponding norm space:

A+BA+Bnorm𝐴𝐵norm𝐴norm𝐵||A+B||\leq||A||+||B||| | italic_A + italic_B | | ≤ | | italic_A | | + | | italic_B | |
Proposition K.4 (Matrix norm compatibility).

The matrix norm compatibility, A𝐑a×b,B𝐑b×c,v𝐑bformulae-sequence𝐴superscript𝐑𝑎𝑏formulae-sequence𝐵superscript𝐑𝑏𝑐𝑣superscript𝐑𝑏A\in\mathbf{R}^{a\times b},B\in\mathbf{R}^{b\times c},v\in\mathbf{R}^{b}italic_A ∈ bold_R start_POSTSUPERSCRIPT italic_a × italic_b end_POSTSUPERSCRIPT , italic_B ∈ bold_R start_POSTSUPERSCRIPT italic_b × italic_c end_POSTSUPERSCRIPT , italic_v ∈ bold_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT:

ABmAmBmsubscriptnorm𝐴𝐵𝑚subscriptnorm𝐴𝑚subscriptnorm𝐵𝑚\displaystyle||AB||_{m}\leq||A||_{m}||B||_{m}| | italic_A italic_B | | start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≤ | | italic_A | | start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | | italic_B | | start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
AvmAmvsubscriptnorm𝐴𝑣𝑚subscriptnorm𝐴𝑚norm𝑣\displaystyle||Av||_{m}\leq||A||_{m}||v||| | italic_A italic_v | | start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≤ | | italic_A | | start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | | italic_v | |
Proposition K.5 (Peter Paul inequality).

x,yfor-all𝑥𝑦\forall x,y∀ italic_x , italic_y and ϵ>0for-allitalic-ϵ0\forall\epsilon>0∀ italic_ϵ > 0, we have the following inequality:

2x,y1ϵx2+ϵy22𝑥𝑦1italic-ϵsuperscriptnorm𝑥2italic-ϵsuperscriptnorm𝑦22\langle x,y\rangle\leq\frac{1}{\epsilon}||x||^{2}+\epsilon||y||^{2}2 ⟨ italic_x , italic_y ⟩ ≤ divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG | | italic_x | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ | | italic_y | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

K.3 Assumptions

Assumption K.1 (L-smooth local objectives).

ifor-all𝑖\forall i∀ italic_i, Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is LFsubscript𝐿𝐹L_{F}italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT-Smooth, the main proposition is shown in Prop.K.1. Notice that the Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assumed to be L-smooth and non-convex, which matches the problem and neural network architecture setting in the main paper.

Assumption K.2 (Bounded local variance).

The local problem’s gradient is assumed not to be too far from the global problem’s gradient.

w,𝐄iFi(w)F(w)σFfor-all𝑤subscript𝐄𝑖normsubscript𝐹𝑖𝑤𝐹𝑤subscript𝜎𝐹\forall w,\mathbf{E}_{i}||\nabla F_{i}(w)-\nabla F(w)||\leq\sigma_{F}∀ italic_w , bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) - ∇ italic_F ( italic_w ) | | ≤ italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
Assumption K.3 (Bounded approximated gradient).

The first-order approximation of the local problem’s gradient gi,rtsuperscriptsubscript𝑔𝑖𝑟𝑡g_{i,r}^{t}italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT should not be too far from the ground truth Fi(wi,rt)subscript𝐹𝑖superscriptsubscript𝑤𝑖𝑟𝑡\nabla F_{i}(w_{i,r}^{t})∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). In this assumption, the approximated error of the block coordinate descent in Algorithm1 is bounded.

{(i,r,t)},gi,rtFi(wi,rt)δfor-all𝑖𝑟𝑡normsuperscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscriptsubscript𝑤𝑖𝑟𝑡𝛿\forall\{(i,r,t)\},||g_{i,r}^{t}-\nabla F_{i}(w_{i,r}^{t})||\leq\delta∀ { ( italic_i , italic_r , italic_t ) } , | | italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | | ≤ italic_δ

K.4 Lemmas

Lemma K.1 (Bounded local approximation error).

If η~:=ηR12LFassign~𝜂𝜂𝑅12subscript𝐿𝐹\tilde{\eta}:=\eta R\leq\frac{1}{2L_{F}}over~ start_ARG italic_η end_ARG := italic_η italic_R ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG, we have the following bound of client drift error:

1NRi,rN,R𝐄gi,r(t)Fi(w(t))22δ2+2R+3LF[3η~2iN𝐄Fi(w(t))2+2η~2δ2R]1𝑁𝑅superscriptsubscript𝑖𝑟𝑁𝑅𝐄superscriptnormsuperscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscript𝑤𝑡22superscript𝛿2superscript2𝑅3subscript𝐿𝐹delimited-[]3superscript~𝜂2superscriptsubscript𝑖𝑁𝐄superscriptnormsubscript𝐹𝑖superscript𝑤𝑡22superscript~𝜂2superscript𝛿2𝑅\frac{1}{NR}\sum_{i,r}^{N,R}\mathbf{E}||g_{i,r}^{(t)}-\nabla F_{i}(w^{(t)})||^%{2}\leq 2\delta^{2}+2^{R+3}L_{F}[3\tilde{\eta}^{2}\sum_{i}^{N}\mathbf{E}||%\nabla F_{i}(w^{(t)})||^{2}+\frac{2\tilde{\eta}^{2}\delta^{2}}{R}]divide start_ARG 1 end_ARG start_ARG italic_N italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N , italic_R end_POSTSUPERSCRIPT bold_E | | italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT italic_R + 3 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ 3 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_E | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ]
Proof.

The client drift error on given ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT client and its upper bound are as follows:

𝐄gi,r(t)Fi(w(t))2𝐄superscriptnormsuperscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscript𝑤𝑡2\displaystyle\mathbf{E}||g_{i,r}^{(t)}-\nabla F_{i}(w^{(t)})||^{2}bold_E | | italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(7)
\displaystyle\leq2𝐄gi,r(t)Fi(wi,r(t))2+2𝐄Fi(w(t))Fi(wi,r(t))22𝐄superscriptnormsuperscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscriptsubscript𝑤𝑖𝑟𝑡22𝐄superscriptnormsubscript𝐹𝑖superscript𝑤𝑡subscript𝐹𝑖superscriptsubscript𝑤𝑖𝑟𝑡2\displaystyle 2\mathbf{E}||g_{i,r}^{(t)}-\nabla F_{i}(w_{i,r}^{(t)})||^{2}+2%\mathbf{E}||\nabla F_{i}(w^{(t)})-\nabla F_{i}(w_{i,r}^{(t)})||^{2}2 bold_E | | italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 bold_E | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq2δ2+2LF𝐄wi,r(t)w(t)22superscript𝛿22subscript𝐿𝐹𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟𝑡superscript𝑤𝑡2\displaystyle 2\delta^{2}+2L_{F}\mathbf{E}||w_{i,r}^{(t)}-w^{(t)}||^{2}2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where the first inequality is by PropositionK.3 and the second one is by AssumptionK.1.

For the last term in the upper bound, we have the iterative formulation as follows:

𝐄wi,r(t)w(t)2𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟𝑡superscript𝑤𝑡2\displaystyle\mathbf{E}||w_{i,r}^{(t)}-w^{(t)}||^{2}bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle==𝐄wi,r1(t)w(t)gi,r1(t)2𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟1𝑡superscript𝑤𝑡superscriptsubscript𝑔𝑖𝑟1𝑡2\displaystyle\mathbf{E}||w_{i,r-1}^{(t)}-w^{(t)}-g_{i,r-1}^{(t)}||^{2}bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq2𝐄wi,r1(t)w(t)ηFi(w(t))2+2η2𝐄gi,r1(t)Fi(w(t))22𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟1𝑡superscript𝑤𝑡𝜂subscript𝐹𝑖superscript𝑤𝑡22superscript𝜂2𝐄superscriptnormsuperscriptsubscript𝑔𝑖𝑟1𝑡subscript𝐹𝑖superscript𝑤𝑡2\displaystyle 2\mathbf{E}||w_{i,r-1}^{(t)}-w^{(t)}-\eta\nabla F_{i}(w^{(t)})||%^{2}+2\eta^{2}\mathbf{E}||g_{i,r-1}^{(t)}-\nabla F_{i}(w^{(t)})||^{2}2 bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E | | italic_g start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq2(1+12R)𝐄wi,r1(t)w(t)2+2(1+2R)η2𝐄Fi(w(t))22112𝑅𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟1𝑡superscript𝑤𝑡2212𝑅superscript𝜂2𝐄superscriptnormsubscript𝐹𝑖superscript𝑤𝑡2\displaystyle 2(1+\frac{1}{2R})\mathbf{E}||w_{i,r-1}^{(t)}-w^{(t)}||^{2}+2(1+2%R)\eta^{2}\mathbf{E}||\nabla F_{i}(w^{(t)})||^{2}2 ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_R end_ARG ) bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 + 2 italic_R ) italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+4η2[δ2+LF2𝐄wi,r(t)w(t)2]4superscript𝜂2delimited-[]superscript𝛿2superscriptsubscript𝐿𝐹2𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟𝑡superscript𝑤𝑡2\displaystyle+4\eta^{2}[\delta^{2}+L_{F}^{2}\mathbf{E}||w_{i,r}^{(t)}-w^{(t)}|%|^{2}]+ 4 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle==2(1+12R+2η2LF2)𝐄wi,r1(t)w(t)2+4η2δ22112𝑅2superscript𝜂2superscriptsubscript𝐿𝐹2𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟1𝑡superscript𝑤𝑡24superscript𝜂2superscript𝛿2\displaystyle 2(1+\frac{1}{2R}+2\eta^{2}L_{F}^{2})\mathbf{E}||w_{i,r-1}^{(t)}-%w^{(t)}||^{2}+4\eta^{2}\delta^{2}2 ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_R end_ARG + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2(1+2R)η2𝐄Fi(w(t))2212𝑅superscript𝜂2𝐄superscriptnormsubscript𝐹𝑖superscript𝑤𝑡2\displaystyle+2(1+2R)\eta^{2}\mathbf{E}||\nabla F_{i}(w^{(t)})||^{2}+ 2 ( 1 + 2 italic_R ) italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where the two inequalities are by PropositionK.3, PropositionK.5 and Eq.(7).

Take η~:=ηR12LFassign~𝜂𝜂𝑅12subscript𝐿𝐹\tilde{\eta}:=\eta R\leq\frac{1}{2L_{F}}over~ start_ARG italic_η end_ARG := italic_η italic_R ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG, we recursively unroll the inequality as follows:

𝐄wi,r(t)w(t)2𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟𝑡superscript𝑤𝑡2\displaystyle\mathbf{E}||w_{i,r}^{(t)}-w^{(t)}||^{2}bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq2(1+1R)𝐄wi,r1(t)w(t)2+4η2δ2+2(1+2R)η2𝐄Fi(w(t))2211𝑅𝐄superscriptnormsuperscriptsubscript𝑤𝑖𝑟1𝑡superscript𝑤𝑡24superscript𝜂2superscript𝛿2212𝑅superscript𝜂2𝐄superscriptnormsubscript𝐹𝑖superscript𝑤𝑡2\displaystyle 2(1+\frac{1}{R})\mathbf{E}||w_{i,r-1}^{(t)}-w^{(t)}||^{2}+4\eta^%{2}\delta^{2}+2(1+2R)\eta^{2}\mathbf{E}||\nabla F_{i}(w^{(t)})||^{2}2 ( 1 + divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ) bold_E | | italic_w start_POSTSUBSCRIPT italic_i , italic_r - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 + 2 italic_R ) italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq[3η~2𝐄Fi(w(t))2+2η~2δ2R]2R+2delimited-[]3superscript~𝜂2𝐄superscriptnormsubscript𝐹𝑖superscript𝑤𝑡22superscript~𝜂2superscript𝛿2𝑅superscript2𝑅2\displaystyle[3\tilde{\eta}^{2}\mathbf{E}||\nabla F_{i}(w^{(t)})||^{2}+\frac{2%\tilde{\eta}^{2}\delta^{2}}{R}]2^{R+2}[ 3 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E | | ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ] 2 start_POSTSUPERSCRIPT italic_R + 2 end_POSTSUPERSCRIPT

where the inequality is unrolled and we use 1R11𝑅1\frac{1}{R}\leq 1divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ≤ 1. Thus, we have:

𝐄gi,r(t)Fi(w(t))22δ2+2R+4η~2LF[3σF2+3𝐄F(w(t))2+δ2R]𝐄superscriptnormsuperscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscript𝑤𝑡22superscript𝛿2superscript2𝑅4superscript~𝜂2subscript𝐿𝐹delimited-[]3superscriptsubscript𝜎𝐹23𝐄superscriptnorm𝐹superscript𝑤𝑡2superscript𝛿2𝑅\displaystyle\mathbf{E}||g_{i,r}^{(t)}-\nabla F_{i}(w^{(t)})||^{2}\leq 2\delta%^{2}+2^{R+4}\tilde{\eta}^{2}L_{F}[3\sigma_{F}^{2}+3\mathbf{E}||\nabla F(w^{(t)%})||^{2}+\frac{\delta^{2}}{R}]bold_E | | italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT italic_R + 4 end_POSTSUPERSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ 3 italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ]

K.5 Theorem and Discussion

Theorem K.2 (Non-convex and smooth convergence of FedPFT).

Let AssumptionK.1, AssumptionK.2 and AssumptionK.3 hold, if η~:=ηRmin{12LF,η^}assign~𝜂𝜂𝑅12subscript𝐿𝐹^𝜂\tilde{\eta}:=\eta R\leq\min\{\frac{1}{2L_{F}},\hat{\eta}\}over~ start_ARG italic_η end_ARG := italic_η italic_R ≤ roman_min { divide start_ARG 1 end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_η end_ARG } is taken, where η^:=N/S124(N1)2RσF21assign^𝜂𝑁𝑆124𝑁1superscript2𝑅superscriptsubscript𝜎𝐹21\hat{\eta}:=\frac{N/S-1}{24(N-1)2^{R}}\sigma_{F}^{2}-1over^ start_ARG italic_η end_ARG := divide start_ARG italic_N / italic_S - 1 end_ARG start_ARG 24 ( italic_N - 1 ) 2 start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1, we have the following bound:

𝒪(𝐄F(w(t¯))2):=𝒪(ΔFη^T+2R/3LF1/3(RσF2+δ2)1/3ΔF2/3T2/3R1/3+(σFLF(N/S1)ΔFTN)+δ2)assign𝒪𝐄superscriptnorm𝐹superscript𝑤¯𝑡2𝒪subscriptΔ𝐹^𝜂𝑇superscript2𝑅3superscriptsubscript𝐿𝐹13superscript𝑅superscriptsubscript𝜎𝐹2superscript𝛿213superscriptsubscriptΔ𝐹23superscript𝑇23superscript𝑅13subscript𝜎𝐹subscript𝐿𝐹𝑁𝑆1subscriptΔ𝐹𝑇𝑁superscript𝛿2\displaystyle\mathcal{O}(\mathbf{E}||\nabla F(w^{(\bar{t})})||^{2}):=\mathcal{%O}(\frac{\Delta_{F}}{\hat{\eta}T}+\frac{2^{R/3}L_{F}^{1/3}(R\sigma_{F}^{2}+%\delta^{2})^{1/3}\Delta_{F}^{2/3}}{T^{2/3}R^{1/3}}+(\frac{\sigma_{F}\sqrt{L_{F%}(N/S-1)\Delta_{F}}}{\sqrt{TN}})+\delta^{2})caligraphic_O ( bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( over¯ start_ARG italic_t end_ARG ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) := caligraphic_O ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_η end_ARG italic_T end_ARG + divide start_ARG 2 start_POSTSUPERSCRIPT italic_R / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ( italic_R italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG + ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT square-root start_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_N / italic_S - 1 ) roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG italic_T italic_N end_ARG end_ARG ) + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Proof.
𝐄F(w(t+1))𝐄F(w(t))𝐄𝐹superscript𝑤𝑡1𝐄𝐹superscript𝑤𝑡\displaystyle\mathbf{E}F(w^{(t+1)})-\mathbf{E}F(w^{(t)})bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) - bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT )
\displaystyle\leq𝐄F(w(t)),w(t+1)w(t)+LF2𝐄w(t+1)w(t)2𝐄𝐹superscript𝑤𝑡superscript𝑤𝑡1superscript𝑤𝑡subscript𝐿𝐹2𝐄superscriptnormsuperscript𝑤𝑡1superscript𝑤𝑡2\displaystyle\mathbf{E}\langle\nabla F(w^{(t)}),w^{(t+1)}-w^{(t)}\rangle+\frac%{L_{F}}{2}\mathbf{E}||w^{(t+1)}-w^{(t)}||^{2}bold_E ⟨ ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG bold_E | | italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle==η~𝐄F(w(t)),g(t)+η~2LF2𝐄g(t)2~𝜂𝐄𝐹superscript𝑤𝑡superscript𝑔𝑡superscript~𝜂2subscript𝐿𝐹2𝐄superscriptnormsuperscript𝑔𝑡2\displaystyle-\tilde{\eta}\mathbf{E}\langle\nabla F(w^{(t)}),g^{(t)}\rangle+%\frac{\tilde{\eta}^{2}L_{F}}{2}\mathbf{E}||g^{(t)}||^{2}- over~ start_ARG italic_η end_ARG bold_E ⟨ ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⟩ + divide start_ARG over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG bold_E | | italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle==η~𝐄F(w(t))2η~𝐄F(w(t)),g(t)F(w(t))+η~2LF2𝐄g(t)2~𝜂𝐄superscriptnorm𝐹superscript𝑤𝑡2~𝜂𝐄𝐹superscript𝑤𝑡superscript𝑔𝑡𝐹superscript𝑤𝑡superscript~𝜂2subscript𝐿𝐹2𝐄superscriptnormsuperscript𝑔𝑡2\displaystyle-\tilde{\eta}\mathbf{E}||\nabla F(w^{(t)})||^{2}-\tilde{\eta}%\mathbf{E}\langle\nabla F(w^{(t)}),g^{(t)}-\nabla F(w^{(t)})\rangle+\frac{%\tilde{\eta}^{2}L_{F}}{2}\mathbf{E}||g^{(t)}||^{2}- over~ start_ARG italic_η end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over~ start_ARG italic_η end_ARG bold_E ⟨ ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ⟩ + divide start_ARG over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG bold_E | | italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leqη~2𝐄F(w(t))2+η~2𝐄1NRi,rN,Rgi,r(t)Fi(w(t))2+η~2LF2𝐄g(t)2~𝜂2𝐄superscriptnorm𝐹superscript𝑤𝑡2~𝜂2𝐄superscriptnorm1𝑁𝑅superscriptsubscript𝑖𝑟𝑁𝑅superscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscript𝑤𝑡2superscript~𝜂2subscript𝐿𝐹2𝐄superscriptnormsuperscript𝑔𝑡2\displaystyle-\frac{\tilde{\eta}}{2}\mathbf{E}||\nabla F(w^{(t)})||^{2}+\frac{%\tilde{\eta}}{2}\mathbf{E}||\frac{1}{NR}\sum_{i,r}^{N,R}g_{i,r}^{(t)}-\nabla F%_{i}(w^{(t)})||^{2}+\frac{\tilde{\eta}^{2}L_{F}}{2}\mathbf{E}||g^{(t)}||^{2}- divide start_ARG over~ start_ARG italic_η end_ARG end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG over~ start_ARG italic_η end_ARG end_ARG start_ARG 2 end_ARG bold_E | | divide start_ARG 1 end_ARG start_ARG italic_N italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N , italic_R end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG bold_E | | italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leqη~2𝐄F(w(t))2+η~2𝐄1NRi,rN,Rgi,r(t)Fi(w(t))2~𝜂2𝐄superscriptnorm𝐹superscript𝑤𝑡2~𝜂2𝐄superscriptnorm1𝑁𝑅superscriptsubscript𝑖𝑟𝑁𝑅superscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscript𝑤𝑡2\displaystyle-\frac{\tilde{\eta}}{2}\mathbf{E}||\nabla F(w^{(t)})||^{2}+\frac{%\tilde{\eta}}{2}\mathbf{E}||\frac{1}{NR}\sum_{i,r}^{N,R}g_{i,r}^{(t)}-\nabla F%_{i}(w^{(t)})||^{2}- divide start_ARG over~ start_ARG italic_η end_ARG end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG over~ start_ARG italic_η end_ARG end_ARG start_ARG 2 end_ARG bold_E | | divide start_ARG 1 end_ARG start_ARG italic_N italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N , italic_R end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+3η~2LF2𝐄[g(t)Fi(w(t))2+1Si𝒮(t)Fi(w(t))F(w(t))2+F(w(t))2]3superscript~𝜂2subscript𝐿𝐹2𝐄delimited-[]superscriptnormsuperscript𝑔𝑡subscript𝐹𝑖superscript𝑤𝑡2superscriptnorm1𝑆subscript𝑖superscript𝒮𝑡subscript𝐹𝑖superscript𝑤𝑡𝐹superscript𝑤𝑡2superscriptnorm𝐹superscript𝑤𝑡2\displaystyle+\frac{3\tilde{\eta}^{2}L_{F}}{2}\mathbf{E}[||g^{(t)}-\nabla F_{i%}(w^{(t)})||^{2}+||\frac{1}{S}\sum_{i\in\mathcal{S}^{(t)}}\nabla F_{i}(w^{(t)}%)-\nabla F(w^{(t)})||^{2}+||\nabla F(w^{(t)})||^{2}]+ divide start_ARG 3 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG bold_E [ | | italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle==η~(13η~LF)2𝐄F(w(t))2+η~(1+3η~LF)21NRi,rN,R𝐄gi,r(t)Fi(w(t))2~𝜂13~𝜂subscript𝐿𝐹2𝐄superscriptnorm𝐹superscript𝑤𝑡2~𝜂13~𝜂subscript𝐿𝐹21𝑁𝑅superscriptsubscript𝑖𝑟𝑁𝑅𝐄superscriptnormsuperscriptsubscript𝑔𝑖𝑟𝑡subscript𝐹𝑖superscript𝑤𝑡2\displaystyle-\frac{\tilde{\eta}(1-3\tilde{\eta}L_{F})}{2}\mathbf{E}||\nabla F%(w^{(t)})||^{2}+\frac{\tilde{\eta}(1+3\tilde{\eta}L_{F})}{2}\frac{1}{NR}\sum_{%i,r}^{N,R}\mathbf{E}||g_{i,r}^{(t)}-\nabla F_{i}(w^{(t)})||^{2}- divide start_ARG over~ start_ARG italic_η end_ARG ( 1 - 3 over~ start_ARG italic_η end_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG over~ start_ARG italic_η end_ARG ( 1 + 3 over~ start_ARG italic_η end_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG divide start_ARG 1 end_ARG start_ARG italic_N italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N , italic_R end_POSTSUPERSCRIPT bold_E | | italic_g start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+3η~2LF21Si𝒮(t)Fi(w(t))F(w(t))23superscript~𝜂2subscript𝐿𝐹2superscriptnorm1𝑆subscript𝑖superscript𝒮𝑡subscript𝐹𝑖superscript𝑤𝑡𝐹superscript𝑤𝑡2\displaystyle+\frac{3\tilde{\eta}^{2}L_{F}}{2}||\frac{1}{S}\sum_{i\in\mathcal{%S}^{(t)}}\nabla F_{i}(w^{(t)})-\nabla F(w^{(t)})||^{2}+ divide start_ARG 3 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG | | divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leqη~(13η~LF)2𝐄F(w(t))2+3η~2LFN/S1N1[σF2+F(w(t))2]~𝜂13~𝜂subscript𝐿𝐹2𝐄superscriptnorm𝐹superscript𝑤𝑡23superscript~𝜂2subscript𝐿𝐹𝑁𝑆1𝑁1delimited-[]superscriptsubscript𝜎𝐹2superscriptnorm𝐹superscript𝑤𝑡2\displaystyle-\frac{\tilde{\eta}(1-3\tilde{\eta}L_{F})}{2}\mathbf{E}||\nabla F%(w^{(t)})||^{2}+3\tilde{\eta}^{2}L_{F}\frac{N/S-1}{N-1}[\sigma_{F}^{2}+||%\nabla F(w^{(t)})||^{2}]- divide start_ARG over~ start_ARG italic_η end_ARG ( 1 - 3 over~ start_ARG italic_η end_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT divide start_ARG italic_N / italic_S - 1 end_ARG start_ARG italic_N - 1 end_ARG [ italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+η~(1+3η~LF)[δ2+2R+3η~2LF[3σF2+3𝐄F(w(t))2+δ2R]]~𝜂13~𝜂subscript𝐿𝐹delimited-[]superscript𝛿2superscript2𝑅3superscript~𝜂2subscript𝐿𝐹delimited-[]3superscriptsubscript𝜎𝐹23𝐄superscriptnorm𝐹superscript𝑤𝑡2superscript𝛿2𝑅\displaystyle+\tilde{\eta}(1+3\tilde{\eta}L_{F})[\delta^{2}+2^{R+3}\tilde{\eta%}^{2}L_{F}[3\sigma_{F}^{2}+3\mathbf{E}||\nabla F(w^{(t)})||^{2}+\frac{\delta^{%2}}{R}]]+ over~ start_ARG italic_η end_ARG ( 1 + 3 over~ start_ARG italic_η end_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) [ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT italic_R + 3 end_POSTSUPERSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ 3 italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ] ]

where the four inequalities are respectively by LFsubscript𝐿𝐹L_{F}italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT-smooth of F:=𝐄iFiassign𝐹subscript𝐄𝑖subscript𝐹𝑖F:=\mathbf{E}_{i}F_{i}italic_F := bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, PropositionK.5, LemmaK.1 and the similar classic Lemma 4 in [29].

Let c1:=3δ2assignsubscript𝑐13superscript𝛿2c_{1}:=3\delta^{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := 3 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, c2:=3LFσF2N/S1N1assignsubscript𝑐23subscript𝐿𝐹superscriptsubscript𝜎𝐹2𝑁𝑆1𝑁1c_{2}:=3L_{F}\sigma_{F}^{2}\frac{N/S-1}{N-1}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := 3 italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_N / italic_S - 1 end_ARG start_ARG italic_N - 1 end_ARG, c3:=2R+3LF[3σF2+δ2R]assignsubscript𝑐3superscript2𝑅3subscript𝐿𝐹delimited-[]3superscriptsubscript𝜎𝐹2superscript𝛿2𝑅c_{3}:=2^{R+3}L_{F}[3\sigma_{F}^{2}+\frac{\delta^{2}}{R}]italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT := 2 start_POSTSUPERSCRIPT italic_R + 3 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ 3 italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ],

𝐄F(w(t+1))𝐄F(w(t))𝐄𝐹superscript𝑤𝑡1𝐄𝐹superscript𝑤𝑡absent\displaystyle\mathbf{E}F(w^{(t+1)})-\mathbf{E}F(w^{(t)})\leqbold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) - bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤η~2{1[323N/S1N1σF2+72×2Rη~]}𝐄F(w(t))2~𝜂21delimited-[]323𝑁𝑆1𝑁1superscriptsubscript𝜎𝐹272superscript2𝑅~𝜂𝐄superscriptnorm𝐹superscript𝑤𝑡2\displaystyle-\frac{\tilde{\eta}}{2}\{1-[\frac{3}{2}-3\frac{N/S-1}{N-1}\sigma_%{F}^{2}+72\times 2^{R}\tilde{\eta}]\}\mathbf{E}||\nabla F(w^{(t)})||^{2}- divide start_ARG over~ start_ARG italic_η end_ARG end_ARG start_ARG 2 end_ARG { 1 - [ divide start_ARG 3 end_ARG start_ARG 2 end_ARG - 3 divide start_ARG italic_N / italic_S - 1 end_ARG start_ARG italic_N - 1 end_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 72 × 2 start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT over~ start_ARG italic_η end_ARG ] } bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+c3η~3+c2η~2+c1η~subscript𝑐3superscript~𝜂3subscript𝑐2superscript~𝜂2subscript𝑐1~𝜂\displaystyle+c_{3}\tilde{\eta}^{3}+c_{2}\tilde{\eta}^{2}+c_{1}\tilde{\eta}+ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG
\displaystyle\leqη~2𝐄F(w(t))2+c3η~3+c2η~2+c1η~~𝜂2𝐄superscriptnorm𝐹superscript𝑤𝑡2subscript𝑐3superscript~𝜂3subscript𝑐2superscript~𝜂2subscript𝑐1~𝜂\displaystyle-\frac{\tilde{\eta}}{2}\mathbf{E}||\nabla F(w^{(t)})||^{2}+c_{3}%\tilde{\eta}^{3}+c_{2}\tilde{\eta}^{2}+c_{1}\tilde{\eta}- divide start_ARG over~ start_ARG italic_η end_ARG end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG

where let η~min{12LF,η^\tilde{\eta}\leq\min\{\frac{1}{2L_{F}},\hat{\eta}over~ start_ARG italic_η end_ARG ≤ roman_min { divide start_ARG 1 end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_η end_ARG, where η^:=23×2R+4N/S1N1σF21}\hat{\eta}:=\frac{2}{3\times 2^{R+4}}\frac{N/S-1}{N-1}\sigma_{F}^{2}-1\}over^ start_ARG italic_η end_ARG := divide start_ARG 2 end_ARG start_ARG 3 × 2 start_POSTSUPERSCRIPT italic_R + 4 end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_N / italic_S - 1 end_ARG start_ARG italic_N - 1 end_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 }. Re-arranging the inequality above and accumulating, we have:

12𝐄F(w(t))212𝐄superscriptnorm𝐹superscript𝑤𝑡2\displaystyle\frac{1}{2}\mathbf{E}||\nabla F(w^{(t)})||^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT𝐄F(w(t+1))𝐄F(w(t))+c3η~2+c2η~+c1absent𝐄𝐹superscript𝑤𝑡1𝐄𝐹superscript𝑤𝑡subscript𝑐3superscript~𝜂2subscript𝑐2~𝜂subscript𝑐1\displaystyle\leq\mathbf{E}F(w^{(t+1)})-\mathbf{E}F(w^{(t)})+c_{3}\tilde{\eta}%^{2}+c_{2}\tilde{\eta}+c_{1}≤ bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) - bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
12Tt=0t=T1𝐄F(w(t))212𝑇superscriptsubscript𝑡0𝑡𝑇1𝐄superscriptnorm𝐹superscript𝑤𝑡2\displaystyle\frac{1}{2T}\sum_{t=0}^{t=T-1}\mathbf{E}||\nabla F(w^{(t)})||^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = italic_T - 1 end_POSTSUPERSCRIPT bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT𝐄F(w(T))𝐄F(w(0))+c3η~2+c2η~+c1absent𝐄𝐹superscript𝑤𝑇𝐄𝐹superscript𝑤0subscript𝑐3superscript~𝜂2subscript𝑐2~𝜂subscript𝑐1\displaystyle\leq\mathbf{E}F(w^{(T)})-\mathbf{E}F(w^{(0)})+c_{3}\tilde{\eta}^{%2}+c_{2}\tilde{\eta}+c_{1}≤ bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) - bold_E italic_F ( italic_w start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_η end_ARG + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Let ΔF=F(w0)F(w)subscriptΔ𝐹𝐹superscript𝑤0𝐹superscript𝑤\Delta_{F}=F(w^{0})-F(w^{*})roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_F ( italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_F ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), where wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the minimum of the main problem argminw𝐄F(w)subscript𝑤𝐄𝐹𝑤\arg\min_{w}\mathbf{E}F(w)roman_arg roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT bold_E italic_F ( italic_w ). To measure the exact term of the bounds, we consider the following cases:

  • ΔFc3Tη~3subscriptΔ𝐹subscript𝑐3𝑇superscript~𝜂3\frac{\Delta_{F}}{c_{3}T}\leq\tilde{\eta}^{3}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_T end_ARG ≤ over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT or ΔFc2Tη~2subscriptΔ𝐹subscript𝑐2𝑇superscript~𝜂2\frac{\Delta_{F}}{c_{2}T}\leq\tilde{\eta}^{2}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_ARG ≤ over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, let η~=min{(ΔFc3T)1/3,(ΔFc2T)1/2}~𝜂superscriptsubscriptΔ𝐹subscript𝑐3𝑇13superscriptsubscriptΔ𝐹subscript𝑐2𝑇12\tilde{\eta}=\min\{(\frac{\Delta_{F}}{c_{3}T})^{1/3},(\frac{\Delta_{F}}{c_{2}T%})^{1/2}\}over~ start_ARG italic_η end_ARG = roman_min { ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_T end_ARG ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT , ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT },we have:

    12𝐄F(w(t))2c31/3ΔF2/3T2/3+(c2ΔFT)1/2+c112𝐄superscriptnorm𝐹superscript𝑤𝑡2superscriptsubscript𝑐313superscriptsubscriptΔ𝐹23superscript𝑇23superscriptsubscript𝑐2subscriptΔ𝐹𝑇12subscript𝑐1\frac{1}{2}\mathbf{E}||\nabla F(w^{(t)})||^{2}\leq\frac{c_{3}^{1/3}\Delta_{F}^%{2/3}}{T^{2/3}}+(\frac{c_{2}\Delta_{F}}{T})^{1/2}+c_{1}divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG + ( divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
  • ΔFc3Tη~3subscriptΔ𝐹subscript𝑐3𝑇superscript~𝜂3\frac{\Delta_{F}}{c_{3}T}\geq\tilde{\eta}^{3}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_T end_ARG ≥ over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and ΔFc2Tη~2subscriptΔ𝐹subscript𝑐2𝑇superscript~𝜂2\frac{\Delta_{F}}{c_{2}T}\geq\tilde{\eta}^{2}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_ARG ≥ over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, let η~=η^~𝜂^𝜂\tilde{\eta}=\hat{\eta}over~ start_ARG italic_η end_ARG = over^ start_ARG italic_η end_ARG,we have:

    12𝐄F(w(t))2ΔFη^T+c31/3ΔF2/3T2/3+(c2ΔFT)1/2+c112𝐄superscriptnorm𝐹superscript𝑤𝑡2subscriptΔ𝐹^𝜂𝑇superscriptsubscript𝑐313superscriptsubscriptΔ𝐹23superscript𝑇23superscriptsubscript𝑐2subscriptΔ𝐹𝑇12subscript𝑐1\frac{1}{2}\mathbf{E}||\nabla F(w^{(t)})||^{2}\leq\frac{\Delta_{F}}{\hat{\eta}%T}+\frac{c_{3}^{1/3}\Delta_{F}^{2/3}}{T^{2/3}}+(\frac{c_{2}\Delta_{F}}{T})^{1/%2}+c_{1}divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_η end_ARG italic_T end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG + ( divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Uniformly sample a t¯[T]1¯𝑡delimited-[]𝑇1\bar{t}\in[T]-1over¯ start_ARG italic_t end_ARG ∈ [ italic_T ] - 1, we have the upper bound as follows:

1Tt=0T11𝑇superscriptsubscript𝑡0𝑇1\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT𝐄||F(w(t)||2)=𝒪(𝐄||F(w(t¯))||2)\displaystyle\mathbf{E}{||\nabla F(w^{(t)}||^{2})}=\mathcal{O}(\mathbf{E}||F(w%^{(\bar{t})})||^{2})bold_E | | ∇ italic_F ( italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = caligraphic_O ( bold_E | | italic_F ( italic_w start_POSTSUPERSCRIPT ( over¯ start_ARG italic_t end_ARG ) end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
:=assign\displaystyle:=:=𝒪(ΔFη^T+2R/3LF1/3(RσF2+δ2)1/3ΔF2/3T2/3R1/3+(σFLF(N/S1)ΔFTN)+δ2)𝒪subscriptΔ𝐹^𝜂𝑇superscript2𝑅3superscriptsubscript𝐿𝐹13superscript𝑅superscriptsubscript𝜎𝐹2superscript𝛿213superscriptsubscriptΔ𝐹23superscript𝑇23superscript𝑅13subscript𝜎𝐹subscript𝐿𝐹𝑁𝑆1subscriptΔ𝐹𝑇𝑁superscript𝛿2\displaystyle\mathcal{O}(\frac{\Delta_{F}}{\hat{\eta}T}+\frac{2^{R/3}L_{F}^{1/%3}(R\sigma_{F}^{2}+\delta^{2})^{1/3}\Delta_{F}^{2/3}}{T^{2/3}R^{1/3}}+(\frac{%\sigma_{F}\sqrt{L_{F}(N/S-1)\Delta_{F}}}{\sqrt{TN}})+\delta^{2})caligraphic_O ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_η end_ARG italic_T end_ARG + divide start_ARG 2 start_POSTSUPERSCRIPT italic_R / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ( italic_R italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG + ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT square-root start_ARG italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_N / italic_S - 1 ) roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG italic_T italic_N end_ARG end_ARG ) + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

Remark K.2.1.

According to TheoremK.2, our proposed FedPFT converges at a sub-linear level. The linear term 𝒪(ΔFη^T)𝒪subscriptΔ𝐹^𝜂𝑇\mathcal{O}(\frac{\Delta_{F}}{\hat{\eta}T})caligraphic_O ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_η end_ARG italic_T end_ARG ) is affected by η^^𝜂\hat{\eta}over^ start_ARG italic_η end_ARG and the initialization gap ΔFsubscriptΔ𝐹\Delta_{F}roman_Δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. The sub-linear term 𝒪(1/T2/3)𝒪1superscript𝑇23\mathcal{O}(1/T^{2/3})caligraphic_O ( 1 / italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) is affected by R𝑅Ritalic_R, especially when R𝑅Ritalic_R is large due to the exponential factor 2Rsuperscript2𝑅2^{R}2 start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. As the local approximation error of the gradient δ𝛿\deltaitalic_δ grows, both the convergence radius 𝒪(δ)𝒪𝛿\mathcal{O}(\delta)caligraphic_O ( italic_δ ) and the sub-linear term 𝒪(1/T2/3)𝒪1superscript𝑇23\mathcal{O}(1/T^{2/3})caligraphic_O ( 1 / italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) are affected by the local optimizer selection significantly. Another sub-linear term 𝒪(T)𝒪𝑇\mathcal{O}(\sqrt{T})caligraphic_O ( square-root start_ARG italic_T end_ARG ) is eliminated if N/S1=0𝑁𝑆10N/S-1=0italic_N / italic_S - 1 = 0 when all the clients are sampled. Otherwise, the sub-linear rate is mainly affected by σFsubscript𝜎𝐹\sigma_{F}italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT.

FedPFT aligns the training objectives across clients by introducing pκ,isubscript𝑝𝜅𝑖p_{\kappa,i}italic_p start_POSTSUBSCRIPT italic_κ , italic_i end_POSTSUBSCRIPT and reduces the impact of non-IID data on the feature extractor ϕitalic-ϕ\phiitalic_ϕ through contrastive learning. Both of these designs can effectively reduce differences in local gradients among clients during training, thereby reducing σFsubscript𝜎𝐹\sigma_{F}italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and subsequently lowering the upper bound. During training,