A Privacy Preserving System for Movie Recommendations Using Federated Learning

David Neumann 0000-0003-1907-8329 Scientific ResearcherFraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI[0]Department of Artificial Intelligence [1]Efficient Deep Learning Group Einsteinufer 3710587BerlinGermany [email protected] Andreas Lutz 0000-0002-2973-0096 Student Research AssistantFraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI[0]Department of Artificial Intelligence [1]Efficient Deep Learning Group Einsteinufer 3710587BerlinGermany [email protected] Karsten Müller 0000-0001-8611-7864 Head of Efficient Deep Learning GroupFraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI[0]Department of Artificial Intelligence [1]Efficient Deep Learning Group Einsteinufer 3710587BerlinGermany [email protected]  and  Wojciech Samek 0000-0002-6283-3265 Head of Department of Artificial Intelligence and Head of Explainable AI GroupFraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI[0]Department of Artificial Intelligence [1]Explainable AI Group Einsteinufer 3710587BerlinGermany ProfessorTechnical University of BerlinDepartment of Electrical Engineering and Computer ScienceMarchstraße 2310587BerlinGermany [email protected]
(2023)
Abstract.

Recommender systems have become ubiquitous in the past years. They solve the tyranny of choice problem faced by many users, and are utilized by many online businesses to drive engagement and sales. Besides other criticisms, like creating filter bubbles within social networks, recommender systems are often reproved for collecting considerable amounts of personal data. However, to personalize recommendations, personal information is fundamentally required. A recent distributed learning scheme called federated learning has made it possible to learn from personal user data without its central collection. Consequently, we present a recommender system for movie recommendations, which provides privacy and thus trustworthiness on multiple levels: First and foremost, it is trained using federated learning and thus, by its very nature, privacy-preserving, while still enabling users to benefit from global insights. Furthermore, a novel federated learning scheme, called FedQ, is employed, which not only addresses the problem of non-i.i.d.-ness and small local datasets, but also prevents input data reconstruction attacks by aggregating client updates early. Finally, to reduce the communication overhead, compression is applied, which significantly compresses the exchanged neural network parametrizations to a fraction of their original size. We conjecture that this may also improve data privacy through its lossy quantization stage.

federated learning, distributed learning, federated recommender systems, neural network compression
copyright: rightsretainedjournal: TORSjournalyear: 2023journalvolume: 1journalnumber: 1article: 1publicationmonth: 1doi: 10.1145/3634686ccs: Information systems Recommender systemsccs: Computing methodologies Distributed artificial intelligenceccs: Security and privacy Privacy-preserving protocolsccs: Computing methodologies Machine learning approachesccs: Security and privacy Privacy protectionsccs: Computer systems organization Client-server architectures
Acknowledgements.
This work was created as part of the COPA EUROPE project (COllaborative Platform for trAnsmedia storytelling and cross channel distribution of EUROPEan sport events), which has received funding from the European Union’s Sponsor Horizon 2020 Research and Innovation Programme https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-2020_en under Grant Agreement No. Grant #957059.

1. Introduction

Due to the ever-increasing sizes of corpora of items such as movies, articles, games, non-digital goods, etc., the task of finding novel and engaging content or products for each individual user or customer becomes increasingly difficult, even with the help of search engines. This problem is known as the tyranny of choice (Schwartz, 2004). Therefore, well-engineered recommender systems are one of the most important pieces of technology for the success of many digital enterprises, providing them with the required engagement and sales. Harvard Business Review even calls RecSys’s the single most important algorithmic distinction between “born digital” enterprises and legacy companies (Schrage, 2017). 80% of the content people watch on Netflix sources from a RecSys, and they estimate that recommendations and personalization save them 1 billion USD per year (Gomez-Uribe and Hunt, 2016). 35% of what customers purchase at Amazon comes from a RecSys (MacKenzie et al., 2013). At Airbnb, search ranking and similar listing recommendations drive 99% of all booking conversions (Grbovic and Cheng, 2018).

Accordingly, a growing number of online businesses are adopting RecSys’s to expand customer engagement and sales. This causes a worrying trend of companies gathering and storing continuously increasing amounts of personal customer data. Even with data protection legislation like the European Union’s General Data Protection Regulation (GDPR(European Parliament, 2016) it is opaque to users what data is collected and arduous to take agency over one’s personal data. All this gathered and derived personal information is at risk of being misused or leaked.

On one hand, in order to improve the personalization of customer recommendations, personal information is indispensable. On the other hand, the principles of data economy and data avoidance are essential to preserve user’s privacy, and provide them with control over their own personal data. Recently, federated learning (FL) was introduced as a distributed machine learning (ML) method, which avoids the centralized accumulation of user data entirely and thus provides data privacy. Unlike regular ML training algorithms with centrally collected data, FL is designed to leave the data at its origin and instead train many models or variants of one model on each of these local datasets. The clients only share the training updates, which are then aggregated into an updated global model. As a result, all participating clients benefit from distributively training the model on all data, without ever sharing the data itself. Accordingly, this scheme, first introduced by Konečný et al. (2016), is aimed towards scenarios in which the local data is privacy sensitive and thus owners do not want to disclose it.

While classical RecSys approaches usually only require user interaction data as input signals, modern approaches can use more privacy-sensitive input signals, such as age, gender, country of origin, and device information. This has the potential to further improve the predictive power of RecSys’s. The privacy-preserving nature of FL makes it a perfect fit for training RecSys models without users having to give up their personal data. Furthermore, FL helps to distribute the burden of data storage and the computational overhead of training among many clients. On the contrary, FL also has the following disadvantages: (1) training time will be increased as compared to traditional central training, because client devices are less capable and not always available, (2) non- independent and identically distributed (i.i.d.) data can hinder convergence and result in a model with lower performance than its centrally trained counterpart, (3) battery usage of mobile client devices will increase due to the complex computations required to train the model, resulting in shorter battery lives, and (4) the communication overhead of continuously exchanging training updates between the clients and the central server, which is especially problematic when clients are on a metered mobile connection.

The exploration of the combination of FL and RecSys’s towards the subfield of federated recommender systems has only recently started and has not yet been fully explored in the literature with only a few publications available on this topic. Therefore, this work introduces an end-to-end, high-performance, scalable FedRec solution for movie recommendations, which is entirely driven by FL and addresses common issues of FedRec’s. System scalability is verified through experiments conducted on more than 162,000 FL clients. To our best knowledge, this is the first work with this client range.

The proposed system inherently provides privacy and thus trustworthiness on several levels: First, through the federated training that only transmits neural network (NN) parameters, while every participating client’s personal data remains private. Second, early aggregation of client updates prevents input data reconstruction attacks. And third, we apply lossy neural network coding (NNC) compression methods that not only provide significant communication reduction, but we also conjecture that its quantization acts as a parameter obfuscation and thus may also strengthen the FL setup against input data reconstruction attacks.

A common challenge among RecSys’s is that most users only produce extraordinarily little training data, while a tiny fraction of highly engaged users produce a lot of training data. In classical RecSys’s this is primarily an issue because these few users dominate the RecSys and suppress the interests of less frequent users. In FedRec’s this poses the additional problem of small noisy updates, which can hinder global model convergence. To counteract this problem, we introduce a technique to chain client trainings together in a privacy-preserving manner, in order to produce more stable model updates. To summarize, our contributions are as follows:

  1. (1)

    A privacy-preserving movie RecSys trained end-to-end using FL

  2. (2)

    Extreme scalability with experimental evidence for more than 162,000 clients

  3. (3)

    Compressed communication between central server and clients with state-of-the-art NNC

  4. (4)

    Novel queue-based federated training to address non-i.i.d. and imbalanced local datasets

2. Related Work

2.1. Recommender Systems

Initially RecSys’s for collaborative filtering tasks were often modeled using matrix factorization techniques. The general idea is to embed input signals, like users and items, in a joint latent space, and quantify the similarity between them using an interaction function, which is the dot product in the simplest case (Koren et al., 2009). Several approaches in the literature were introduced to enhance the predictive power of the model, e.g., incorporating additional features (Chen et al., 2011) or combining it with neighborhood models (Koren, 2008). Since matrix factorization relies on linear dependencies between the input signals, substituting any arbitrary function for the inner product led to promising results. He et al. utilized a deep neural network (DNN) for this task, which proved to be better suited for capturing the latent structures in the data, resulting in higher prediction accuracy (He et al., 2017; Covington et al., 2016). Several architectures in the context of deep learning (DL) were proposed to further improve the baseline models for RecSys’s. Choe et al. utilized a recurrent neural network (RNN) to include time series data from the previous items the user has interacted with (Choe et al., 2021). To address the limits of RNNs for sequential recommendations Tang and Wang proposed a convolutional neural network (CNN) to incorporate the fact that dependency relations were not necessarily the consequence of consecutive user-item interactions (Tang and Wang, 2018). Sedhain et al. used an item-based autoencoder to reconstruct ratings received as an input (Sedhain et al., 2015). Wu et al. enhanced this approach by employing a denoising autoencoder, which can handle corrupted data (Wu et al., 2016). Ying et al. used a graph convolutional network (GCN) to combine graph convolutions and efficient random walks to solve scalability issues faced in web-scale recommendation tasks (Ying et al., 2018).

2.2. Federated Learning

\Ac

fl is a recently proposed distributed learning scheme, which was originally proposed by Konečný et al. (2016), where a set of client devices C𝐶Citalic_C, jointly train an ML model MGlobalsubscript𝑀𝐺𝑙𝑜𝑏𝑎𝑙M_{Global}italic_M start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT on their private datasets 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Usually, FL is performed under the supervision of a central coordinating server. In traditional ML the local client datasets would be accumulated into a central dataset 𝒟Central=i=1|C|𝒟isubscript𝒟𝐶𝑒𝑛𝑡𝑟𝑎𝑙superscriptsubscript𝑖1𝐶subscript𝒟𝑖\mathcal{D}_{Central}=\bigcup_{i=1}^{\left|C\right|}\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_C italic_e italic_n italic_t italic_r italic_a italic_l end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on which a central model MCentralsubscript𝑀𝐶𝑒𝑛𝑡𝑟𝑎𝑙M_{Central}italic_M start_POSTSUBSCRIPT italic_C italic_e italic_n italic_t italic_r italic_a italic_l end_POSTSUBSCRIPT is trained. In FL, the local datasets are never disclosed by the clients. Instead, the central server initializes a global model MGlobalsubscript𝑀𝐺𝑙𝑜𝑏𝑎𝑙M_{Global}italic_M start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT parameterized by a vector 𝜽d𝜽superscript𝑑\boldsymbol{\theta}\in\mathbb{R}^{d}bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, that is sent to all clients cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which train the global model on their local datasets 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, effectively optimizing their local objective i(𝒟i;𝜽)subscript𝑖subscript𝒟𝑖𝜽\mathcal{L}_{i}(\mathcal{D}_{i};\boldsymbol{\theta})caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ), which results in a local update Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT111Depending on the specific FL algorithm that is being used, these local updates are of different type, e.g., in FedSGD an update is represented by the gradient (Konečný et al., 2016), in FedAvg the update is represented by the parametrization of the updated local model (McMahan et al., 2017), and in federated distillation the update is represented by the soft labels that were produced by the updated local model on a central training dataset (Jeong et al., 2023).. The local update is then sent back to the central server, which uses an aggregation operator to combine the updates into an updated global model MGlobal=Agg{Uii{1,2,,|C|}}subscriptsuperscript𝑀𝐺𝑙𝑜𝑏𝑎𝑙Aggconditional-setsubscript𝑈𝑖𝑖12𝐶M^{{}^{\prime}}_{Global}=\text{Agg}\left\{U_{i}\mid i\in\left\{1,2,\dots,|C|% \right\}\right\}italic_M start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT = Agg { italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ { 1 , 2 , … , | italic_C | } }. This process is repeated until a suitable convergence metric is met. The objective of FL can therefore be stated as the following minimization problem (McMahan et al., 2017):

(1) min𝜽di=1|C||𝒟i||j=1|C|𝒟j|i(𝒟i;𝜽),wherei(𝒟i;𝜽)=1|𝒟i|x,y𝒟i(x,y;𝜽),subscript𝜽superscript𝑑superscriptsubscript𝑖1𝐶subscript𝒟𝑖superscriptsubscript𝑗1𝐶subscript𝒟𝑗subscript𝑖subscript𝒟𝑖𝜽wheresubscript𝑖subscript𝒟𝑖𝜽1subscript𝒟𝑖subscript𝑥𝑦subscript𝒟𝑖𝑥𝑦𝜽\displaystyle\min_{\boldsymbol{\theta}\in\mathbb{R}^{d}}\sum_{i=1}^{\left|C% \right|}\frac{\left|\mathcal{D}_{i}\right|}{\left|\bigcup_{j=1}^{\left|C\right% |}\mathcal{D}_{j}\right|}\mathcal{L}_{i}(\mathcal{D}_{i};\boldsymbol{\theta}),% \text{where}\qquad\mathcal{L}_{i}(\mathcal{D}_{i};\boldsymbol{\theta})=\frac{1% }{\left|\mathcal{D}_{i}\right|}\sum_{x,y\in\mathcal{D}_{i}}\ell(x,y;% \boldsymbol{\theta}),roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | ⋃ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , where caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x , italic_y ∈ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_x , italic_y ; bold_italic_θ ) ,

with (x,y;𝜽)𝑥𝑦𝜽\ell(x,y;\boldsymbol{\theta})roman_ℓ ( italic_x , italic_y ; bold_italic_θ ) denoting the loss of the client model on input x𝑥xitalic_x with ground-truth y𝑦yitalic_y, given the model parametrization 𝜽𝜽\boldsymbol{\theta}bold_italic_θ. \Acfl allows the global model MGlobalsubscript𝑀𝐺𝑙𝑜𝑏𝑎𝑙M_{Global}italic_M start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT to train on significantly more data than if each client had only trained on its private data. Thus, under ideal conditions, given a performance metric P𝑃Pitalic_P, the performance of the global model PGlobalsubscript𝑃𝐺𝑙𝑜𝑏𝑎𝑙P_{Global}italic_P start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT should be better than that of each individual client i{1,2,,|C|}:PGlobal>Pi:for-all𝑖12𝐶subscript𝑃𝐺𝑙𝑜𝑏𝑎𝑙subscript𝑃𝑖\forall i\in\left\{1,2,\dots,|C|\right\}:P_{Global}>P_{i}∀ italic_i ∈ { 1 , 2 , … , | italic_C | } : italic_P start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT > italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. \Acfl permits a certain degree of deviation from the performance of an equivalent centrally trained model but provides data security and privacy protection in return. Still, the goal is to minimize the deviation |PCentralPGlobal|subscript𝑃𝐶𝑒𝑛𝑡𝑟𝑎𝑙subscript𝑃𝐺𝑙𝑜𝑏𝑎𝑙|P_{Central}-P_{Global}|| italic_P start_POSTSUBSCRIPT italic_C italic_e italic_n italic_t italic_r italic_a italic_l end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT |.

In the original FL scheme, federated stochastic gradient descent (FedSGD), proposed by Konečný et al. (2016), the clients perform a training step and send the computed gradient back to the central server, which averages the gradient across all clients and applies it to the global model. Since then, several other methods have been proposed in the literature. McMahan et al. proposed federated averaging (FedAvg), where the clients train for multiple local epochs and send their updated local model to the central server instead of the gradient. The updated parameters are then weighted proportionally by the number of local training samples available to each client and then averaged by the central server (McMahan et al., 2017). Furthermore, they employ client sub-sampling, a technique where only a random subset of clients is selected for each communication round (Fraboni et al., 2023; Chen et al., 2022). FedAvg can be seen as a generalization of FedSGD, which only executes a single iteration of gradient descent in each round of communication (Shokri and Shmatikov, 2015; McMahan et al., 2017). Although there were theoretical guarantees for the convergence of FedAvg in cases of heterogeneous data, impractical assumptions such as strong convexity or smoothness of the objective function needed to hold (Li et al., 2020a). Chai et al. showed experimentally, that FedAvg could lose up to 9% accuracy in comparison to FedSGD (Chai et al., 2022), when dealing with non-i.i.d. data. Li et al. tackled this problem and presented a generalization of FedAvg. They introduced a surrogate objective to constrain the locally updated parameters to be close to the current global model. This helped to stabilize convergence behavior resulting in a significant increase in test accuracy by 22% on average (Li et al., 2019). Li et al. proposed to only share the trainable parameters of batch normalization (BatchNorm) with the central server without communicating their running averages of the batch statistics to the server. Aggregating the trainable parameters from all clients but keeping the running averages local helps to alleviate the problem of feature shift in non-i.i.d. training scenarios (Li et al., 2021). Karimireddy et al. utilize control variates as a variance reduction technique to approximate the update direction of the server model and each client model. The client drift, which naturally arises from training on different local data distributions, can be estimated by the difference between these update directions and is corrected by adding it in the local training of each client (Karimireddy et al., 2020). Cao et al. rely on clustering the clients according to the classes of data they possess. They only average parameters from the same group while updating the central server model, guaranteeing that parameters are only averaged on a set of clients with a comparable data distribution (Cao et al., 2022). Seol and Kim propose a two step approach. Firstly, they use data oversampling to eliminate data class imbalances among clients. In the second step the clients are selected in such a way, that their data distribution is nearly uniform. Furthermore, the central server constantly adjusts the amount of data for local training, the batch size, and the learning rate of the clients to avoid performance degradation (Seol and Kim, 2023). We also address data heterogeneity and introduce our own generalization of FedAvg, named federated learning with client queuing (FedQ).

Although FL operates in a decentralized environment, the participating client’s privacy may be compromised by merely transmitting the training update. Geiping et al. reconstructed high-resolution images by examining the data present in each client’s communicated gradients (Geiping et al., 2020). Dimitrov et al. were also able to extract sensitive information contained in the weights obtained by the FedAvg procedure. Therefore, the concept of differential privacy (Dwork and Roth, 2014) is often applied in the setting of FL. When working with aggregated data, differential privacy can be utilized to protect the private information contained in individual data points. Differential privacy achieves this data protection by perturbing the data points with random noise. This exploits the fact that a single data point has relatively little impact on the aggregated data as a whole, but adding random noise alters the individual data points to a degree that no useful information can be extracted from them (Dwork, 2008). Wei et al. proposed to add specific noise to the parameters of each client before aggregation by the central server (Wei et al., 2020a). This ensures a decent training accuracy while a certain level of privacy is maintained, if there are a sufficiently large number of clients involved (Wei et al., 2020a). Phong et al. (2018) proposed to use homomorphic encryption in the more general setting of distributed training and Fang and Quan (2021) suggested to use it in the setting of FL. Homomorphic encryption is a specialized encryption scheme that allows performing certain mathematical operations on the data without decrypting it.

2.3. Communication-Efficient Federated Learning

When dealing with mobile clients, internet connections may be inconsistent and potentially have high latency. Even when FL clients are connected via reliable network connections, mobile connections are usually still bandwidth-constrained and, in many cases, even metered. During the course of FL, training updates must be exchanged a multitude of times. Therefore, a central goal in FL is communication minimization. When communicating model parametrizations, possible solutions to this include several size reduction techniques: Sparsification/Pruning excludes single neurons (unstructured) or entire layers of neurons (structured) from an NN. While sparsification only sets excluded neurons to 0, pruning actually removes them (LeCun et al., 1990). Sparsified models are more amenable to compression, but still have their original size when uncompressed. Pruned models, on the other hand, already have a reduction in size even without compression. The disadvantage of pruned networks is that they may require specialized software and/or hardware to be used, while sparsified models can run on regular software and hardware. Distillation is a technique for transferring the knowledge of a teacher model into a smaller student model. This is done by minimizing the difference between the output of the student model and the output of the teacher model (also known as soft labels) on data points from a separate dataset (Hinton et al., 2015). In quantization the weights of an NN are constrained to a discrete set of values so that they can be represented with fewer bits (Gholami et al., 2022). Lossless compression techniques encode the NN data in a way that removes redundancy and thus reduces its size (Han et al., 2016).

There are many works that have developed communication efficient FL solutions using the above-mentioned techniques or combinations of them (Sattler et al., 2019; Konečný et al., 2018; Sattler et al., 2020), and even some with specialized techniques, such as federated dropout (Caldas et al., 2019a). Konečný et al. propose employing quantization, random rotations, and sub-sampling to compress the updated model parameters of the clients before sending them to the central server (Konečný et al., 2018). Wu et al. adopt an orthogonal strategy: The clients train a teacher model on their local data and distill it into a smaller student model. Instead of communicating the gradients of the teacher models, the clients compress and send the gradients of the smaller student models (Wu et al., 2022). Sattler et al. introduce a compression framework combining communication delay methods, gradient sparsification, binarization, and optimal weight update encoding to reduce the upstream communication cost in distributed learning scenarios (Sattler et al., 2019). To adapt it to the FL setting, Sattler et al. enhance this approach, taking the compression of the downstream communication and the non-i.i.d. local data distribution of the clients into account. They construct a framework combining a novel top-k𝑘kitalic_k gradient sparsification method with ternarization and optimal Golomb encoding of updated client model parameters (Sattler et al., 2020). Another emerging field of research considers combinations of differential privacy and quantization methods in order to reduce communication costs. Lang and Shlezinger demonstrated that, within their framework, it is possible to quantize data at a given bit rate without sacrificing a specified level of privacy or degrading model performance (Lang and Shlezinger, 2022). They enhanced methods proposed by Reisizadeh et al. and Konečný et al., which solely use quantization and do not include privacy-related considerations.

2.4. Federated Recommender Systems

The current public discussion of RecSys’s (often just referred to as the algorithm or AI personalization), focuses, among other topics, on their invasive behavior concerning personal data collection (Hermann, 2022; Kozyreva et al., 2021; Golbeck, 2016; Lam et al., 2006). This might create a negative relationship between user and RecSys potentially resulting in anything from user discontent to “algorithmic hate” (Smith et al., 2022). \Acprecsys are arguably a vital part of the user experience on the internet since, without them, the flood of content would be barely manageable. Therefore, FL may be part of the solution to the privacy problem of RecSys’s by training the recommender models directly on user devices and thereby entirely circumventing the need for gathering private information.

\Ac

fl has already been proven to work well in many other domains, e.g., cancer research (Rønn Hansen et al., 2022), natural language processing (Lin et al., 2022), graph NNs (He et al., 2021), image classification (Luo et al., 2021), transfer learning (Liu et al., 2020), language models (Brendan McMahan et al., 2018), mobile keyboard prediction (Hard et al., 2019), and keyword spotting (Leroy et al., 2019), so it is reasonable to anticipate that it is likewise effective in the domain of RecSys’s. In fact, there are numerous methods in the literature to incorporate current RecSys frameworks into FL. They can be classified as either focusing on learning algorithms (Ammad-ud-din et al., 2019), security (Ribero et al., 2022), or optimization models (Muhammad et al., 2020), depending on the task’s objective (Alamgir et al., 2022). Matrix factorization is a commonly utilized approach in the first scenario. Ammad-ud-din et al. were among the pioneers in this emerging field by introducing this model to address collaborative filtering tasks in the context of FL. They constructed a RecSys that gives personalized recommendations based on users’ implicit feedback (Ammad-ud-din et al., 2019). Lin et al. designed a new federated rating prediction mechanism for explicit responses. They employed user averaging and hybrid filling in order to keep the system computationally efficient and the communication costs moderately low (Lin et al., 2021a).

To increase the model capabilities for each client, Jia and Lei incorporated a bias term for the input signals. Additionally, weights on the local devices were adjusted, so that any unreasonable user rating is removed (Jia and Lei, 2021). On the other hand, Flanagan et al. employed a similar strategy, enhancing the model’s capacity by incorporating input from other data sources (Flanagan et al., 2021). Wang et al. introduced a new algorithmic approach by combining matrix factorization with FedAvg. They demonstrated, that the cost of communication with the central server for non-i.i.d. data was decreased by limiting the number of local training iterations (Wang et al., 2021).

As previously shown, private information can be reconstructed from the clients’ transmitted parameters. In order to remedy this, a variety of privacy preserving techniques based on encryption, obfuscation, or masking can be utilized (Asad et al., 2023). Communication of encrypted data between the central server and its clients is made possible through the use of homomorphic encryption, allowing for intermediate calculations without the need to first decrypt the data. As a result, the central server is unable to infer the data it is working with (Kim et al., 2018). For this reason, Chai et al. propose a secure matrix factorization framework to handle data leakage. They showed how privacy could be compromised by intercepting the clients’ gradient updates sent in two consecutive communication rounds to the central server. To address this problem, they encrypted the clients’ gradients before sending them to the central server (Chai et al., 2021). Zhang and Jiang enhanced the approach by clustering the encrypted user embeddings to reduce the dimension of the user-item matrix, improving the recommendation’s accuracy (Zhang and Jiang, 2021). Lin et al. utilized a different cryptographic technique: they applied secret sharing, wherein a group of clients can only reconstruct sensitive information if they collaborate by combining their shares (Shamir, 1979). By applying this concept to the clients’ locally computed gradients, the authors managed to construct a FedRec framework that provides strong privacy guarantees on the clients’ individual data (Lin et al., 2021b). Another technique concerns secure multi-party computation, that refers to a protocol for computing a function based on the data of a group of clients without disclosing private information to one another (Cramer et al., 2015). Perifanis and Efraimidis utilized this approach in the setting of federated neural collaborative filtering (NCF). They demonstrated that employing a secure multi-party computation protocol for FedAvg protects privacy when dealing with an honest but curious entity without compromising the quality of the RecSys (Perifanis and Efraimidis, 2022).

Differential privacy falls in the category of privacy preservation techniques that use obfuscation. Ribero et al. added differential privacy to FL utilizing a matrix factorization technique. They succeeded in balancing the privacy loss posed by the repetitive nature of the FL process by only requiring a few rounds of communication (Ribero et al., 2022). Yang et al. designed a matrix factorization-based RecSys that adds Laplacian random noise to the users’ encrypted item embeddings, ensuring a high level of security (Yang et al., 2021). Minto et al. proposed a system combining differential privacy and implicit user feedback. They constrained the number of local gradient updates sent by the users by the level of privacy each user tries to maintain (Minto et al., 2021). We also address the problem of privacy preservation by obfuscation: Instead of applying random noise to the weight updates that are sent to the central server, the weights are quantized, which is both conducive to privacy preservation and reducing the communication overhead. We later provide a detailed attack analysis of the exchanged model parameters, that are potentially susceptible to leak information about the underlying datasets of the participating clients. We present specific attacks applicable to our scenario and examine how their requirements and assumptions do not apply to our approach to privacy preservation, thus rendering them ineffective.

Another method of achieving data security is by introducing pseudo interactions in order to mask user behavior in FedRec’s. This protection mechanism is implemented by adding artificial interactions with randomly selected items to users. This causes the central server to be unable to determine the real set of items a user has interacted with, as the uploaded gradient was computed with respect to both real and artificial interactions. (Lin et al., 2021a). Since this method produces noisy gradients, degrading the model performance, Liang et al. introduced denoising clients in the training process (Liang et al., 2021). Another approach that hits the same mark, but entirely foregoes FL was presented by Wainakh et al. (2019). They employ a random walk-based approach to decentralized optimization, where a randomly chosen client trains its local model for one or multiple epochs before sending its updated parameters to a randomly selected neighboring client according to the underlying graph structure (Sun et al., 2022; Triastcyn et al., 2022). Wainakh et al. adapt this approach to account for privacy by introducing the anonymous random walk technique where clients, instead of training a model, can choose to add their own data to an existing dataset that was sent by a neighboring client in a prior round. The accumulated data can then be uploaded to the central server for centralized training. Due to the nature of the random walk, neither the clients nor the central server know where the individual samples of the accumulated dataset originate from, thus effectively masking the users’ identities.

Dealing with the statistical heterogeneity of the clients’ local data in the context of FedRec’s is a different area of research. There are various proposed strategies for addressing this issue, which primarily include clustering and meta learning (Sun et al., 2023). Jie et al. designed a FedRec utilizing a clustering approach based on historical parameters to form homogeneous groups of clients, in which a personalized model can be trained. These parameters are retrieved by averaging the model parameters from the clients’ last communication rounds with the central server (Jie et al., 2022). Chen et al. proposed a different method based on model-agnostic meta-learning, which is a training paradigm where a meta-learner is employed to rapidly train models on new tasks. The meta-learner itself is a trainable algorithm that trains a model on a task, which consists of a support set and a query set. The model is trained using the support set and then evaluated on the query set. Based on this evaluation, a loss is computed, which reflects the ability of the meta-learner to train the model. The meta-learner is then updated to minimize this loss. For example, the meta-learner in the model-agnostic meta-learning (MAML(Finn et al., 2017) algorithm is used to provide an initial set of parameters for the model that is trained on the task. Meta-learning algorithms are known to generalize effectively to new tasks, which makes them well-suited for tackling the non-i.i.d. problem in FL. For this reason, Chen et al. adapted MAML, as well as another meta-learning algorithm called Meta-SGD, to the FL setting, which enabled them to reach higher model performance than the FedAvg baseline (Chen et al., 2019). Our FedRec was not only affected by heterogeneous client data but also by exceedingly small local datasets. Our approach to non-i.i.d.-ness, FedQ, therefore differs greatly from the two above-mentioned approaches, as neither clustering nor meta-learning are capable of handling truly small local datasets.

The clients’ potentially constrained resources are the subject of another line of research. Therefore, Muhammad et al. utilized a simple DNN with small embedding sizes to balance the number of learnable parameters and the accuracy of the resulting recommendations. Additionally, they presented a new sampling technique coupled with an active aggregation method, which reduced communication costs and produced more accurate models even at an early stage of training (Muhammad et al., 2020). Zhang et al. addressed related problems and developed a new framework that effectively integrates a novel matrix factorization technique with privacy via a federated discrete optimization algorithm. Although the model’s RAM, storage, and communication bandwidth requirements were modest, performance was not affected and was even superior to related state of the art techniques (Zhang et al., 2022). Our suggested approach combines all three of the aforementioned sorts of objectives: We balance the model complexity and capacity by opting for a simple, yet scalable DNN architecture. This results in remaining resource-efficient on the client side, while still maintaining the possibility of scaling up. Additionally, we anticipate that applying quantization will provide a specific amount of privacy while also lowering the burden associated with exchanging parameters with the central server via potentially bandwidth-constrained network connections.

3. Method

In this work, we propose a framework for a RecSys that is trained end-to-end using FL. Before examining the design of the FedRec and its components, we want to motivate our decisions with a problem statement. Then, we will explore the general architecture of many complex information retrieval systems on which the architecture of our RecSys is based and show how each of these components is constructed. Finally, we will demonstrate how all of this translates into an FL setting and how we alleviate the problems that arise from such a setup.

3.1. Problem Statement

The research documented in this work was conducted as part of the COPA EUROPE project, which is a beneficiary of the EU’s Horizon 2020 Research and Innovation Programme. The project aims to create a live-streaming and video-on-demand (VoD) platform that provides users with sports and esports content. To keep users engaged, discoverability of the content is key, therefore, one part of the project aims at developing a RecSys. Specifically, the objective was to develop a RecSys in an FL setting to provide high-quality recommendations while preserving the user’s privacy. From the project’s goals and objectives, the following requirements for the FedRec can be derived:

  • Large Client Population – A live-streaming and VoD platform for sports and esports may build a large user base, which results in an FL client population that comprises hundreds of thousands or even millions of clients.

  • Large Video Catalog – With dozens of types of sports and esports games covered, and hundreds of leagues, tournaments and events, the catalog of live streams and VoD content may grow substantially over time.

  • Increased Personalization – The FL setup is meant to enable the RecSys to leverage more personal user data in addition to user-item interactions for higher personalization without requiring the data to ever leave the user’s device. The requirement to take advantage of more personal user data implies that the employed ML model must be able to handle multiple data modalities and learn complex, non-linear dependencies between features contained in this data.

  • Substantial Communication Overhead – The potentially large client population leads to a very significant communication overhead for the central server. Furthermore, the clients are expected to use mobile devices that may lack a reliable, high-bandwidth internet connection. Therefore, it is of paramount importance to reduce the communication overhead incurred by the constant communication between the central server and its clients.

The following sections will detail how these requirements were translated into the architecture of the FedRec and the design of its components. All decisions concerning architecture and design, as well as the research into privacy-preservation, scalability, NNC, and the handling of non-i.i.d. and imbalanced local datasets were motivated and informed by these requirements.

3.2. Recommender System Architecture

As the RecSys is required to handle a large user base and movie catalog, we decided to follow the well-known three-stage funnel-like architecture, which is also employed by other forms of information retrieval systems. These three phases comprise: candidate generation, ranking, and re-ranking (cf. Figure 1). The candidate generation phase takes the entire corpus of movies and narrows it down to usually a couple hundred movies that are somewhat relevant to the user. This phase must be fast because it must sift through possibly millions of movies, which in turn means that not all of the resulting elements are 100% relevant to the user. The ranking phase has a more complex model of the user’s interest. It scores each of the candidate movies and ranks them by their scores. This two-step approach to the generation of recommendations greatly expedites the retrieval process. If each item in the corpus had to be ranked individually, this process would not scale well to the large item corpora. Finally, the re-ranking phase is an optional phase, which can implement hand-crafted rules to improve recommendations. This can include rules such as removing click-bait content, enforcing age restrictions, ensuring freshness, and promoting predefined content. These systems will be further explored in the following sections.

Refer to caption
Figure 1. Flow diagram of the “funnel-like” three-stage RecSys architecture of the proposed RecSys, consisting of candidate generation, ranking, and re-ranking stages (inspired by Figure 2 in (Covington et al., 2016)).
\Description

Flow diagram of the “funnel-like” three-stage RecSys architecture of the proposed RecSys, consisting of candidate generation, ranking, and re-ranking stages (inspired by Figure 2 in (Covington et al., 2016)).

3.3. Candidate Generation

Candidate generation is comprised of an algorithm that is trained to select a small number (usually in the order of hundreds) of items from a vast corpus of items (usually in the order of millions) that are generally relevant to the user. One classical approach to candidate generation is matrix factorization. Non-linear models, such as NNs, however, are capable of forming a much deeper “understanding” of the latent structures in the data and NNs have been used in RecSys’s since at least 2016 (Covington et al., 2016). Although there have been attempts to adapt classical ML algorithms for the use in FL, e.g., matrix factorization (Ammad-ud-din et al., 2019), gradient-based learning algorithms are much better suited and well-researched within the framework of FL. Furthermore, NNs allow for much more fine-grained control over model architecture decisions and are capable of handling a diverse set of input data modalities, which is one of the project’s requirements. For this reason, we decided to use a DNN architecture for our candidate generation model.

Prior to choosing a specific design, the training objective must be formulated. For RecSys’s there are many different objectives that are commonly used, e.g., rating prediction, watch time prediction, click-through-rate prediction, and watch prediction. Since the algorithm has to be able to sift through millions of items, the underlying model must be simple and, most importantly, fast. Therefore, we decided to train the candidate generation model on next watch prediction. This means that it receives a list of past movie watches of a user as input and predicts a probability distribution over all movies in the corpus. The top-k𝑘kitalic_k movies can then be interpreted as the movies that the user will most likely watch next. So instead of performing inference on all movies in the corpus, the model only has to be invoked once to retrieve a list of candidate recommendations.

The chosen architecture for the candidate generator model is shown in Figure 2 and inspired by the architecture used in (Covington et al., 2016). An experiment using various recurrent architectures was conducted, but the chosen DNN architecture is the best tradeoff between model performance and size. The results of this experiment can be found in Appendix B.1. The first layer of the model is an embedding layer, which takes the sparse one-hot encoded movie watches and embeds them into a 64-dimensional dense vector space. The size of the embedding vectors was experimentally determined. The experiment results can be found in Appendix B.2. In contrast to recurrent NNs, non-recurrent NNs require inputs of a fixed size. However, the watch histories have variable length and can consist of any number between 1 and window size movie watches. To provide the required fixed-length input for the model, the embedded movie watches are then averaged. In practice, other input features could be added here and concatenated to the watch history vector. For example, user-level information could be utilized to improve predictions, if past movie watches are not available or a user only has a few of them, thereby solving the cold-start problem for new users. Unfortunately, we are restrained by the lack of a suitable dataset, which includes user-level information.

The inputs are then fed into a funnel, or tower-like architecture of multiple fully-connected layers with rectified linear unit (ReLU) activations. The final fully-connected layer prior to the output layer is of size 256 and each preceding layer doubles this number, i.e., for a three-layer architecture, the first fully-connected layer is of size 1024, the second of size 512, and the final layer of size 256. As already mentioned, the size of the model has a substantial impact in an FL setting. Consequently, an experiment was conducted to determine the optimal number of hidden layers. The results of this experiment are presented in Appendix B.3.

Finally, the next-watch prediction is realized in terms of a classification task, therefore, the output layer of the candidate generator model has as many outputs as there are movies in the corpus. The model is then trained using the softmax cross-entropy loss. A detailed breakdown of the layers that comprise the NN architecture of the candidate generator model is presented in Table 3 in Appendix B.4.

Refer to caption
Figure 2. \Acdnn candidate generator model architecture of the RecSys (Covington et al., 2016).
\Description
\Ac

dnn candidate generator model architecture of the RecSys (Covington et al., 2016).

3.4. Ranking

The ranking phase of the RecSys receives the candidate recommendations from the candidate generator phase and ranks them by user relevance. Since it only has to be invoked for a small subset of all movies in the corpus, processing speed is less crucial in contrast to the previous candidate generator model. Therefore, a more precise and complex representation of the user’s interests can be learned. Note that the model must be trained within the FL environment and thus should not be selected too large.

Learning to rank is a well-studied (Cao et al., 2007) problem within ML and there are numerous approaches, ranging from simple point-wise models, that directly predict a rank, and pair-wise models, which learn to rank two items relative to each other, to more elaborate list-wise models, which learn to rank items in a list (Liu, 2009). In the case of a movie RecSys, the ranker model can be implemented as a rating prediction, where the predicted rating is used to sort the items. We decided on this simple approach. It turned out that a simple regression model tended to learn to predict the mean rating if trained without any constraints. Therefore, we decided to re-formulate the problem as a classification task, as the dataset being used contains a discrete set of possible ratings between 0.5 and 5.0 in steps of 0.5, resulting in 10 distinct classes. This approach performs considerably better.

The base architecture of the ranker model is almost equivalent to the design of the candidate generator. The input features, user ID, movie ID, and movie genres are embedded using embedding layers. The optimal embedding sizes were experimentally determined to be 32 for users, 128 for movies, and 16 for genres. A detailed description of these experiments can be found in Appendix C.1. The genre embeddings are then averaged and the resulting vectors are concatenated to form the input of a tower-like classifier, which consists of a single fully-connected layer that outputs a probability distribution over the set of possible ratings. Just like in the case of the candidate generator, we considered adding multiple hidden layers, but experiments with varying numbers of hidden layers determined that a single layer is sufficient. The hidden layer experiments are described in Appendix C.2. Again, more movie-level or user-level information could be added as input features here. Only rating timestamps are provided in the dataset which can be utilized as additional user-level information. By correlating the movies in the dataset with an online movie database, further movie-level information can be retrieved. Therefore, we decided to add the age of the rating and the age of the movie as further input features to determine the efficacy of adding more input signals to the model. A detailed discussion of this can be found in Section 4.1.2. The architecture of the ranker model is shown in Figure 3.

Refer to caption
Figure 3. \Acdnn ranker model architecture of the RecSys.
\Description
\Ac

dnn ranker model architecture of the RecSys.

Since the classes, distinguished by the ranker model, have a hierarchical relation to each other, we considered using other loss functions than softmax cross-entropy. We have experimentally tested other loss functions, but in practice, softmax cross-entropy provides the best results. The results of the experiment can be found in Appendix C.3. A detailed breakdown of the layers that comprise the NN architecture of the ranker model is presented in Table 4 in Appendix C.4.

3.5. Re-ranking

The re-ranking phase is an optional step that is often overlooked in RecSys research, but plays a crucial role in real-world applications. It implements hand-crafted rules to improve recommendations. Examples are the removal of click-bait content, enforcing age restrictions, ensuring freshness, and promoting predefined content. Here, ensuring freshness is probably one of the most important aspects. The candidate generation and ranking phases do not take freshness of the recommended content into consideration, as the ratio between novel and more established content is often hand-tuned (also described as exploration vs. exploitation trade-off). Age restrictions are also important, as the candidate generator model has no filter in place to prevent recommending age-restricted movies to underage users. Both the candidate generator and the ranker models are static, i.e., given the same input, they will always produce the same output (unless further trained in the meantime). Therefore, the re-ranking phase should also randomly select a subset of the final recommendations, e.g., weighted by the rank predicted by the ranker model, in order to ensure that the user will see something different every time they are presented with recommendations. Mixing in some predefined content, for example movies that have just been released, is an effective way of overcoming the cold-start problem for new content. This would increase the chances of new movies being watched and thus generating training data that can be used to recommend the movies later. Finally, the topic of click-bait detection is an interesting one, but it is considered out-of-scope in this work. As the re-ranking phase only consists of hand-crafted rules and thus does not affect the proposed method, we will abstain from delving deeper into its implementation.

3.6. Federated Recommender Systems at Scale Using Queue-Based Federated Learning

Many variants and adaptations were introduced to FL, among which FedAvg (McMahan et al., 2017) is one of the most prevalent. In FedAvg, the server initializes a global model, which is sent to all clients. The clients then proceed to train the model on their local data and send the updated model back to the central server. The central server then aggregates the client models into a new global model by averaging them (usually the mean weighted by the number of samples that the clients trained on is used). The process can be seen in Figure 4 and the algorithm is detailed in Algorithm 1. \Acfedavg has been proven successful in many FL tasks despite theoretical predictions suggesting otherwise (Wang et al., 2022).

Refer to caption
Figure 4. The typical FedAvg scenario with a central coordinating server and several clients with their local data. The central server sends a global model to the clients, which then perform training on local data. The resulting updated local models are sent back to the central server, which aggregates them into a new global model by averaging the model weights.
\Description

The typical FedAvg scenario with a central coordinating server and several clients with their local data. The central server sends a global model to the clients, which then perform training on local data. The resulting updated local models are sent back to the central server, which aggregates them into a new global model by averaging the model weights.

1
2
3
Input :  C𝐶Citalic_C is the set of all clients, 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the local dataset of client ciCsubscript𝑐𝑖𝐶c_{i}\in Citalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C, T𝑇Titalic_T is the number of communication rounds, N𝑁Nitalic_N is the number of clients per communication round, B𝐵Bitalic_B is the batch size, E𝐸Eitalic_E is the number of local epochs, and η𝜂\etaitalic_η is the learning rate
Output : Global Model Parametrization 𝜽Tsubscript𝜽𝑇\boldsymbol{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
4
5
/* Runs on the central server */
6 Initialize 𝜽0subscript𝜽0\boldsymbol{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
7 for each communication round t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T do
8       CtNsubscript𝐶𝑡𝑁C_{t}\leftarrow Nitalic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_N random clients sub-sampled from C𝐶Citalic_C
9       for each client ciCtsubscript𝑐𝑖subscript𝐶𝑡c_{i}\in C_{t}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in parallel do
10             𝜽tisubscriptsuperscript𝜽𝑖𝑡absent\boldsymbol{\theta}^{i}_{t}\leftarrowbold_italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← UpdateClient(𝜽t1,i)subscript𝜽𝑡1𝑖(\boldsymbol{\theta}_{t-1},i)( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_i )
11            
12      𝜽tciCt|𝒟i||cjCt𝒟j|𝜽tisubscript𝜽𝑡subscriptsubscript𝑐𝑖subscript𝐶𝑡subscript𝒟𝑖subscriptsubscript𝑐𝑗subscript𝐶𝑡subscript𝒟𝑗subscriptsuperscript𝜽𝑖𝑡\boldsymbol{\theta}_{t}\leftarrow\sum_{c_{i}\in C_{t}}\frac{\left|\mathcal{D}_% {i}\right|}{\left|\bigcup_{c_{j}\in C_{t}}\mathcal{D}_{j}\right|}\boldsymbol{% \theta}^{i}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | ⋃ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG bold_italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
13      
14return 𝜽Tsubscript𝜽𝑇\boldsymbol{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
15
16
/* Runs on client i𝑖iitalic_i */
17 UpdateClient(𝛉𝛉\boldsymbol{\theta}bold_italic_θ, i𝑖iitalic_i):
18       for each local epoch e=1,,E𝑒1𝐸e=1,\dots,Eitalic_e = 1 , … , italic_E do
19             split 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into |𝒟i|Bsubscript𝒟𝑖𝐵\left\lceil\frac{\left|\mathcal{D}_{i}\right|}{B}\right\rceil⌈ divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG ⌉ batches of size B𝐵Bitalic_B
20             for each batch b=1,,|𝒟i|B𝑏1subscript𝒟𝑖𝐵b=1,\dots,\left\lceil\frac{\left|\mathcal{D}_{i}\right|}{B}\right\rceilitalic_b = 1 , … , ⌈ divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG ⌉ do
21                   𝜽𝜽ηi(𝒟i,b;𝜽)𝜽𝜽𝜂subscript𝑖subscript𝒟𝑖𝑏𝜽\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\eta\nabla\mathcal{L}_{i}(% \mathcal{D}_{i,b};\boldsymbol{\theta})bold_italic_θ ← bold_italic_θ - italic_η ∇ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT ; bold_italic_θ )
22                  
23            return 𝜽𝜽\boldsymbol{\theta}bold_italic_θ
24            
25      
Algorithm 1 Federated Averaging (McMahan et al., 2017)

A significant challenge of FedAvg lies, however, in dealing with non-i.i.d. client data. The data generating distribution may be different for each client, i.e., the data is not independent and identically distributed between the clients. This means that the local objective of each client may differ, sometimes even significantly, from the global training objective, which may lead to conflicting model updates being sent to the central server that hinders the convergence of the global model. There are different types of non-i.i.d.-ness, which include:

  • Covariate Shift – Local samples may have a different statistical distribution compared to the samples of other clients

  • Prior Probability Shift – The labels of the local samples may have a different statistical distribution compared to the samples of other clients

  • Concept Shift – Local samples have the same labels as other clients, but they correspond to different features, or local samples have the same features as other clients, but they correspond to different labels

  • Imbalanced Data – The data available at the clients may vary significantly in size

Many different techniques have been proposed to alleviate the problems associated with non-i.i.d. data, cf. Zhu et al. (2021) for a timely overview of different techniques.

Clients with limited local data are another issue that has a comparable effect to non-i.i.d.-ness. In the case of movie RecSys’s, it is common that most users have only watched a few dozen or maybe a few hundred movies. This can lead to exceedingly small, noisy updates of the local model, which result in the global model not converging. Both the problem of imbalanced data and small local datasets can be attenuated by weighting the local model updates during aggregation by the local dataset size of the client. But this also has the unwanted effect of suppressing the interests of many users with little training data and amplifying the interests of a few users with a lot of training data.

We address both problems of non-i.i.d.-ness and small local datasets by chaining client trainings together. The central server selects a random subset of the client population for each communication round before further subdividing them into small queues of a specified size. The clients constituting a specific queue are assigned uniformly at random from the client subset. The first client in each queue receives the global model for local training, while each consecutive client receives the local model of the client prior to it. The local models of the last client in each of these queues are then aggregated by the central server, similar to FedAvg. The goal of chaining multiple client trainings is that the resulting model updates are less noisy because they were not only exposed to more data but also to data from multiple different distributions, in contrast to what would normally be possible. Since no client in a queue has information about the origin of its local model nor about its position in the queue, this method is still at least as privacy-preserving as regular FL. We call this technique FedQ. Algorithm 2 shows the exact training protocol that we follow.

1
2
3
Input :  C𝐶Citalic_C is the set of all clients, 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the local dataset of client ciCsubscript𝑐𝑖𝐶c_{i}\in Citalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C, T𝑇Titalic_T is the number of communication rounds, N𝑁Nitalic_N is the number of clients per communication round, L𝐿Litalic_L is the client queue length, where L𝐿Litalic_L divides N𝑁Nitalic_N, B𝐵Bitalic_B is the batch size, E𝐸Eitalic_E is the number of local epochs, and η𝜂\etaitalic_η is the learning rate
Output : Global Model Parametrization 𝜽Tsubscript𝜽𝑇\boldsymbol{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
4
5
/* Runs on the central server */
6 Initialize 𝜽0subscript𝜽0\boldsymbol{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
7 for each communication round t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T do
8       CtNsubscript𝐶𝑡𝑁C_{t}\leftarrow Nitalic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_N random clients sub-sampled from C𝐶Citalic_C
9       for k=1,,NL𝑘1𝑁𝐿k=1,\dots,\frac{N}{L}italic_k = 1 , … , divide start_ARG italic_N end_ARG start_ARG italic_L end_ARG in parallel do
             /* First client in the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT queue receives the global model */
10             𝜽tk𝜽t1subscriptsuperscript𝜽𝑘𝑡subscript𝜽𝑡1\boldsymbol{\theta}^{k}_{t}\leftarrow\boldsymbol{\theta}_{t-1}bold_italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
11            
            /* Dataset sizes of the clients in the queue are aggregated for the weighted mean */
12             sk0subscript𝑠𝑘0s_{k}\leftarrow 0italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← 0
13             for each client cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Ct[(k1)L+1:kL]C_{t}[(k-1)L+1:kL]italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ( italic_k - 1 ) italic_L + 1 : italic_k italic_L ] do
14                   𝜽tksubscriptsuperscript𝜽𝑘𝑡absent\boldsymbol{\theta}^{k}_{t}\leftarrowbold_italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← UpdateClient(𝜽tk,i)subscriptsuperscript𝜽𝑘𝑡𝑖(\boldsymbol{\theta}^{k}_{t},i)( bold_italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_i )
15                   sksk+|𝒟i|subscript𝑠𝑘subscript𝑠𝑘subscript𝒟𝑖s_{k}\leftarrow s_{k}+\left|\mathcal{D}_{i}\right|italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |
16                  
17            
18      
19      𝜽tk=1NLsk|ciCt𝒟i|𝜽tksubscript𝜽𝑡superscriptsubscript𝑘1𝑁𝐿subscript𝑠𝑘subscriptsubscript𝑐𝑖subscript𝐶𝑡subscript𝒟𝑖subscriptsuperscript𝜽𝑘𝑡\boldsymbol{\theta}_{t}\leftarrow\sum_{k=1}^{\frac{N}{L}}\frac{s_{k}}{\left|% \bigcup_{c_{i}\in C_{t}}\mathcal{D}_{i}\right|}\boldsymbol{\theta}^{k}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_N end_ARG start_ARG italic_L end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG | ⋃ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG bold_italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
20      
21return 𝜽Tsubscript𝜽𝑇\boldsymbol{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
22
23
/* Runs on client i𝑖iitalic_i */
24 UpdateClient(𝛉,i𝛉𝑖\boldsymbol{\theta},ibold_italic_θ , italic_i):
25       for each local epoch e=1,,E𝑒1𝐸e=1,\dots,Eitalic_e = 1 , … , italic_E do
26             Split 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into |𝒟i|Bsubscript𝒟𝑖𝐵\left\lceil\frac{\left|\mathcal{D}_{i}\right|}{B}\right\rceil⌈ divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG ⌉ batches of size B𝐵Bitalic_B
27             for each batch b=1,,|𝒟i|B𝑏1subscript𝒟𝑖𝐵b=1,\dots,\left\lceil\frac{\left|\mathcal{D}_{i}\right|}{B}\right\rceilitalic_b = 1 , … , ⌈ divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG ⌉ do
28                   𝜽𝜽ηi(𝒟i,b;𝜽)𝜽𝜽𝜂subscript𝑖subscript𝒟𝑖𝑏𝜽\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\eta\nabla\mathcal{L}_{i}(% \mathcal{D}_{i,b};\boldsymbol{\theta})bold_italic_θ ← bold_italic_θ - italic_η ∇ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT ; bold_italic_θ )
29                  
30            return 𝜽𝜽\boldsymbol{\theta}bold_italic_θ
31            
32      
Algorithm 2 FedQ

For the complexity analysis, we compare FedQ to its baseline, FedAvg, with respect to the expected time the central server needs to wait before it can aggregate the updated model parameters of the clients in each communication round. The number of local update steps on the i𝑖iitalic_ith client are given by E|𝒟i|B𝐸subscript𝒟𝑖𝐵E\cdot\frac{\left|\mathcal{D}_{i}\right|}{B}italic_E ⋅ divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG. This implies, that each client performs E[𝔼[|𝒟i|]B]=E[i=1|C||𝒟i||C|B]𝐸delimited-[]𝔼delimited-[]subscript𝒟𝑖𝐵𝐸delimited-[]superscriptsubscript𝑖1𝐶subscript𝒟𝑖𝐶𝐵E\cdot\left[\frac{\mathbb{E}[\left|\mathcal{D}_{i}\right|]}{B}\right]=E\cdot% \left[\frac{\sum_{i=1}^{|C|}\left|\mathcal{D}_{i}\right|}{|C|\cdot B}\right]italic_E ⋅ [ divide start_ARG blackboard_E [ | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] end_ARG start_ARG italic_B end_ARG ] = italic_E ⋅ [ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_C | ⋅ italic_B end_ARG ] steps on average, where the expectation is over the random selection of a client, which follows the uniform distribution (McMahan et al., 2017). Therefore, the expected time complexity for a single communication round, depending on the utilized algorithm, can be expressed as:

FedAvg: 𝒪(PE[i=1|C||𝒟i||C|B])𝒪𝑃𝐸delimited-[]superscriptsubscript𝑖1𝐶subscript𝒟𝑖𝐶𝐵\displaystyle\mathcal{O}\left(P\cdot E\cdot\left[\frac{\sum_{i=1}^{|C|}\left|% \mathcal{D}_{i}\right|}{|C|\cdot B}\right]\right)caligraphic_O ( italic_P ⋅ italic_E ⋅ [ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_C | ⋅ italic_B end_ARG ] )
FedQ: 𝒪(𝐋PE[i=1|C||𝒟i||C|B]),𝒪𝐋𝑃𝐸delimited-[]superscriptsubscript𝑖1𝐶subscript𝒟𝑖𝐶𝐵\displaystyle\mathcal{O}\left(\mathbf{L}\cdot P\cdot E\cdot\left[\frac{\sum_{i% =1}^{|C|}\left|\mathcal{D}_{i}\right|}{|C|\cdot B}\right]\right),caligraphic_O ( bold_L ⋅ italic_P ⋅ italic_E ⋅ [ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_C | ⋅ italic_B end_ARG ] ) ,

where P𝑃Pitalic_P denotes the time of a forward and backward pass on the client’s local model on a batch of data (Dimitrov et al., 2022). Furthermore, it was assumed that the communication time with the central server is dominated by the average local training time for each client. In summary, FedQ requires L𝐿Litalic_L times as much time as FedAvg for each communication round.

During the development of FedQ, further techniques for addressing non-i.i.d.-ness and small local datasets in FL that are partially comparable to FedQ have emerged, for which the similarities with and differences to FedQ are discussed in Appendix F.

3.7. Achieving Communication Efficiency

Besides the problems of data heterogeneity and clients having very little local data, constantly communicating model parametrizations can also lead to a significant overhead. The candidate generator and the ranker models are, depending on the sizes of the embeddings and the number of hidden layers, between 60MB and 120MB in size. Given the massive scale of the user base of a typical movie RecSys, using FL can result in multiple gigabytes of data that must be communicated in each communication round, even at low client sub-sampling rates. Furthermore, the clients are relatively resource constrained, so communication reduction techniques that require complex processing, such as pruning or learned quantization, are not an option.

A recent standard for NNC, ISO/IEC 15938-17:2022 (MPEG-7 part 17) (Moving Picture Experts Group working group of ISO/IEC(2021), MPEG; International Organization for Standardization (2022), ISO; Haase et al., 2021; Kirchhoffer et al., 2022)222A standards-compliant implementation of the NNC standard, under a permissive license, is available on GitHub: https://github.com/fraunhoferhhi/nncodec., which is based on the Deep Context-Adaptive Binary Arithmetic Coding (DeepCABAC) NN compression algorithm (Wiedemann et al., 2020a), has shown excellent compression results and requires only little or no preprocessing. Furthermore, it has already been shown to exhibit remarkably high performance in an FL setup (Neumann et al., 2020). In its coding core, NNC combines specific quantization methods that are adapted to the NN layers, followed by a context-adaptive binary arithmetic coding method, which reduces data redundancy.

Employing the NNC standard to compress the upstream and downstream communication in our proposed FedRec is motivated by the fact that the coding engine at its core, DeepCABAC, permits higher compression performance on a variety of NN architectures than comparable techniques in the literature (Wiedemann et al., 2020b). Wiedemann et al. showed that the NNs can be compressed by a factor of 50.6 on average with negligible loss in performance. Comparable coders based on the weighted Lloyd algorithm (Lloyd, 1982; Choi et al., 2017) or uniform quantization (Lin et al., 2016; Choi et al., 2017) only managed to compress the models by factors of 13.6 and 5.7 respectively. For example, the authors obtained a compression ratio of 1.58% with an accuracy of 69.43% for the VGG16 architecture, whereas comparable literature reports only a compression ratio of 2.05% with an accuracy of 68.83%. Similar results are obtained for the MobileNet-v1 and MixNet architectures, resulting in a compression ratio gain of 3.6 and 92.1 percentage points, respectively, without affecting the model performance. These results are obtained by simply applying DeepCABAC, they do not require the use of any optimization techniques, such as bias correction, distillation, or fine-tuning, rendering the NNC standard a straightforward plug-and-play procedure (Wiedemann et al., 2020b; Neumann et al., 2020).

There are more specialized techniques for reducing the communication overhead in FL that are, however, less comparable to NNC as they are not based on entropy coding. For example, FedFast (Muhammad et al., 2020) is an alternative to FedAvg, which increases convergence speeds of the models and thus reduces the number of times updates have to be communicated between server and client. Muhammad et al. provide an experimental evaluation of their method on MovieLens 1M (Harper and Konstan, 2015), MovieLens 100K (Harper and Konstan, 2015), TripAdvisor hotel reviews (Alam et al., 2016), and the Yelp dataset (Yelp, 2021). On MovieLens 100K, FedFast required ~24.2%333Muhammad et al. claim that for MovieLens 100K FedFast already reached the same performance as FedAvg at communication round 30, which would correspond to approximately 33 times less communicated data, but their own training graphs suggest that this only happened at approximately communication round 196, which corresponds to the factor of approximately 4 that we reported here.,444Muhammad et al. do not publish communication cost savings, so the values presented here were read from the training curves in Figure 3 (Muhammad et al., 2020) and are therefore only approximations. of the communication rounds to achieve the highest performance of FedAvg, which corresponds to around 4 times less data communicated. On MovieLens 1M, FedFast reached the best performance of FedAvg even faster, i.e., after only ~1.13%4 of the communication rounds that FedAvg required, which means that approximately 88 times less data was communicated. For TripAdvisor, FedFast only required ~7.5%4,555Again, Muhammad et al. claim that FedFast was 20 times faster than FedAvg, although their own training curves suggest it was closer to the factor of 13 reported here. of time to reach the highest performance of FedAvg as compared to the time that FedAvg required, which resulted in around 13 times less communication cost. Finally, FedFast required only ~17.8%4 of the communication rounds to reach the highest performance of FedAvg, in contrast to how many communication rounds FedAvg required to reach the performance. This reduces the communication cost of FedFast by almost a factor of 6. These results are, however, not comparable to the compression performance of other methods, as they measure the communication cost required to reach the highest accuracy of FedAvg, which, however, performs very poorly as compared to FedFast and does not even converge in the case of the TripAdvisor and Yelp datasets. Under realistic conditions, one would not stop the training there, but train the model until convergence, which in some cases happened much later. For example, the training curves presented in Figure 3 (Muhammad et al., 2020) seem to suggest that for MovieLens 100K and the Yelp dataset FedFast only reached its own highest accuracy at the very end of the training, after 1,000 communication rounds.

Another interesting approach is that of FedKD (Wu et al., 2022), where the clients train a teacher model, which is then distilled into a smaller student model. \Acfl clients communicate the compressed gradients of the student models, which substantially reduces the communication overhead. Wu et al. report that they accrued 18.6 times less communication cost per client on the MIND (Wu et al., 2020) dataset and 19.9 times less communication cost per client on the ADR (Weissenbacher et al., 2018) dataset as compared to directly using the larger teacher model, with no loss in performance. Both FedFast and FedKD, however, require substantial changes to the FL pipeline, while DeepCABAC consistently offers high, in many cases even the highest reduction in size, while being a plug-and-play solution, that only needs to be applied to the NN model. This justifies our choice of utilizing the NNC framework for our FedRec, since we can expect to have higher compression performances than the previously proposed coding techniques in the literature, without having to integrate any complex optimization techniques.

3.8. Data Security & Privacy Protection

To achieve the goal of data security and privacy protection FL, incorporates the principles of data minimization, i.e., processing the data as early as possible (data processing is carried out on the client’s device), only collecting data that is absolutely necessary (e.g., in FedAvg only model parametrizations are transmitted), and discarding any obtained data as soon as possible (after the client models were aggregated into an updated global model, the local models are discarded). Furthermore, FL employs the principle of anonymization, i.e., no conclusions about the originator shall be drawn from the respective data. In terms of FL, this implies that, ideally, only sending training updates should prevent the central server from deriving any further information about its clients. In practice, however, it has been shown that local samples can be reverse-engineered from the gradients (Geiping et al., 2020) in FedSGD. To alleviate this problem, anonymization techniques, such as differential privacy, where random noise is added to client data communication (Wei et al., 2020a), or homomorphic encryption, where encrypted client updates can be aggregated without decrypting them (Phong et al., 2018; Fang and Quan, 2021), can be utilized.

Generally, these kinds of attacks are performed by the central server, who has access to the gradient updates sent by the clients. The attacks reconstruct the client’s input data by starting with some arbitrary, e.g., randomly initialized input data, and adapting this dummy data in such a way that the distance between its gradient and the actual gradient received from the client is minimized, for example, by solving the following optimization problem (Geiping et al., 2020; Dimitrov et al., 2022):

(2) argminx~dist(𝜽(x~,y;𝜽),𝜽(x,y;𝜽)),~𝑥argmindistsubscript𝜽~𝑥𝑦𝜽subscript𝜽𝑥𝑦𝜽,\displaystyle\underset{\widetilde{x}}{\text{argmin}}\;\text{dist}(\nabla_{% \boldsymbol{\theta}}\,\ell(\widetilde{x},y;\boldsymbol{\theta}),\nabla_{% \boldsymbol{\theta}}\,\ell(x,y;\boldsymbol{\theta}))\text{,}start_UNDERACCENT over~ start_ARG italic_x end_ARG end_UNDERACCENT start_ARG argmin end_ARG dist ( ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( over~ start_ARG italic_x end_ARG , italic_y ; bold_italic_θ ) , ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( italic_x , italic_y ; bold_italic_θ ) ) ,

where x~~𝑥\widetilde{x}over~ start_ARG italic_x end_ARG denotes the dummy input data, x𝑥xitalic_x the unseen training sample of client c𝑐citalic_c, 𝜽(x,y;𝜽)subscript𝜽𝑥𝑦𝜽\nabla_{\boldsymbol{\theta}}\,\ell(x,y;\boldsymbol{\theta})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( italic_x , italic_y ; bold_italic_θ ) the intercepted gradient of client c𝑐citalic_c, 𝜽(x~,y;𝜽)subscript𝜽~𝑥𝑦𝜽\nabla_{\boldsymbol{\theta}}\,\ell(\widetilde{x},y;\boldsymbol{\theta})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( over~ start_ARG italic_x end_ARG , italic_y ; bold_italic_θ ) the gradient computed on the dummy input data x~~𝑥\widetilde{x}over~ start_ARG italic_x end_ARG with the ground-truth y𝑦yitalic_y, which can, for example, be extracted from the gradient of the output layer (Zhao et al., 2020), 𝜽𝜽\boldsymbol{\theta}bold_italic_θ the parametrization of the updated local model of client c𝑐citalic_c, and dist()dist\text{dist}(\cdot)dist ( ⋅ ) a distance function. For example, Geiping et al. (2020) show that in many cases it is possible to use such a technique to reconstruct training images almost perfectly from the gradient, thus demonstrating that FedSGD is not as privacy-preserving as thought. A related method, proposed by Chai et al., is able to reverse-engineer a user’s rating information from two consecutive gradient updates in a FedRec based on matrix factorization, where the factorization is learned by the users using stochastic gradient descent (SGD(Chai et al., 2021). The attack proposed by Chai et al. is specifically tailored towards federated matrix factorization and is therefore not applicable to our scenario. Furthermore, both methods need to intercept the client’s gradient updates and are therefore only pertinent to FedSGD. \Acfedq on the other hand, which is employed by us, does not share the gradient but the updated local model and is thus not vulnerable to these kinds of attacks.

And still, Dimitrov et al. (2022) showed that it is possible to reconstruct training images in realistic FedAvg settings. Despite the method’s success with a single client relying on many local training rounds, attacking aggregated parameter updates from multiple clients, even if only a few of them are used, significantly degrades the reconstruction performance. Using the Federated EMNIST (FEMNIST) dataset for demonstration, they specifically showed that attacking the averaged updates of just four clients instead of one significantly lowers the average reconstruction performance of images with peak signal-to-noise ratios of 20 or above by 35.8 percentage points, which is evaluated on 100 randomly selected clients from the training set. When conducting this experiment, they chose an optimal configuration of local epochs and batch sizes for the clients. In addition, they rely on the unrealistic assumption that the label counts are known. Having to estimate them, degraded the attack performance by 17 percentage points using the updated parameters of only a single client. In FedAvg, an attacker can easily retrieve the parameter updates of individual clients, thus making this kind of attack highly effective. But, by the very nature of FedQ, a potential attacker usually only receives aggregated parameter updates from multiple clients. Thus, using a reasonably large queue of clients should guarantee a high level of data security.

Some recent works have tried to employ gradient/parameter obfuscation to counteract these kinds of attacks. For example, differential privacy is an obfuscation method, where random noise is added to the client updates. While differential privacy is one of the most prevalent obfuscation schemes, others, like gradient quantization and magnitude pruning, have been proposed. For example, Wei et al. (2020b) and Zhu et al. (2019) showed that gradient sparsification is a well-functioning approach to mitigate data reconstruction attacks. Ovi et al. (2023) demonstrated the efficacy of using mixed precision quantization to counteract gradient leakage attacks. They quantized the model gradients of the clients after local training to 16-bit and 8-bit integers before sending them to the central server and showed experimentally that no information was leaked. They ran the attack for 450 iterations both on FL setups where the communicated gradients were quantized, as well as baseline FL setups without gradient quantization using the Modified NIST Database (MNIST), Fashion-MNIST, and CIFAR-10 datasets. In the baseline experiments, training images could be extracted from the gradients after 20 iterations for MNIST, 20 iterations for Fashion-MNIST, and 40 iterations for CIFAR-10. In the experiments that applied gradient quantization they were not able to extract any training samples, even after 450 iterations of their attack. Our NNC module is capable of using an arbitrary number of quantization points for quantizing model updates, where the number of quantization points is fine-tuned for each layer. This results in a quantization that goes well below 16-bit and in many cases even below 8-bit quantization, which should result in better obfuscation.

These works are good indicators that gradient obfuscation techniques can be successfully employed to counteract attacks such as those proposed by Geiping et al. (2020) and Dimitrov et al. (2022). Yue et al. (2023), however, call the effectiveness of gradient obfuscation into question, by proposing a novel data reconstruction attack scheme. However, they have only shown their attack to be effective in the domain of image classification, which is a special case, as even reconstructed images that diverge a lot from the actual input image, may contain enough visual information for human observers, while the same amount of reconstruction error on tabular data, as used in FedRec’s, would not be usable with the same amount of error. Also, they have only tried small batch sizes with a small number of local epochs and have only shown uniform quantization. Therefore, we still conjecture that the error induced by the mixed-precision quantization of our NNC module, may successfully obfuscate the information contained in the parameter updates sent to the central server or at least make it much harder for attackers to recover any useful information. This, however, remains to be tested in future work.

Finally, most of these attacks assume the central server to be the culprit who wants to reconstruct the input data of its clients. We want to note that outside attackers are usually incapable of intercepting any data from the FL process as simple techniques, such as employing SSL/TLS, can effectively mitigate these kinds of attacks.

This concludes our new proposed FedRec, which consists of a three-staged recommendation architecture, including a candidate generation, ranking, and re-ranking stage. Furthermore, the RecSys was extended to use FL, applying the developed FedQ method to effectively operate with extremely high numbers of heterogeneous clients. The communication overhead introduced by constantly communicating parameter updates between the central server and the clients is alleviated by compressing the model parametrizations using a state-of-the-art NN compression scheme. Finally, we have discussed the data security and privacy protection capabilities of the proposed architecture. In the following Section 4, we will evaluate the performance of the FedRec system experimentally.

4. Experiments

In this section, we describe the experiments performed using our FedRec and evaluate its performance. We will first start by describing how the dataset was acquired and processed. Then, we will lay out a non-FL baseline to which we will compare the performance of the FL system. Then we will demonstrate that standard FedAvg only yields a moderate performance before applying the FedQ algorithm to improve performance to equal or even exceed the performance of the non-FL baseline. Finally, we will show how the new NNC standard can be utilized to significantly decrease the communication overhead. All experiments were performed using PyTorch (Paszke et al., 2019).

4.1. Dataset

Among datasets suitable for movie RecSys’s, the MovieLens dataset by Harper and Konstan (2015) is one of the most widely known and used datasets. MovieLens comes in multiple different flavors, among which the 25M variant is the latest stable benchmark dataset. It contains more than 25 million ratings across almost 60,000 movies made by more than 162,000 users. The MovieLens datasets consist of users, movies, ratings, and tags. As the 25M flavor of the MovieLens dataset is a stable benchmark dataset, it was chosen for our experiments.

4.1.1. Dataset Analysis

For the candidate generation model, we treat the ratings of the MovieLens dataset as movie watches to predict future watches from past viewing behavior. Therefore, the temporal cohesion of the ratings is particularly important. During an initial screening of the dataset, we observed that the data was inconsistent with “normal” viewing behavior, at least for a small number of random samples. For example, some users rated an infeasible number of movies in a single day, while other users had an impossibly high number of total ratings. Therefore, the MovieLens 25M dataset was inspected more closely in terms of four different metrics: (1) average times between ratings of all users in the dataset, i.e., the speed at which users have rated movies, (2) number of ratings per user, (3) number of ratings per movie, and (4) number of ratings cast by rating value. The results are shown in Figure 5.

The MovieLens 25M dataset contains 59,047 movies that have been rated 25,000,095 times by 162,541 users. 87.1% of users have an average time of less than 1 minute and 97.3% have an average time of less than 1 hour between two ratings. On average, there are 32.7 minutes between two ratings. The smallest number of ratings per user is 20 and the highest number of ratings of any user is 32,202. On average, each user has 153.8 ratings. 58.8% of movies have less than 10 ratings and 82.5% have less than 100 ratings. On average, each movie has 423.4 ratings. The smallest number of ratings per movie is 1 and the highest number of ratings of any movie is 81,491. The top-10 most-rated movies have amassed 2.8% of all ratings.

Refer to caption
Figure 5. In-depth analysis histograms of the MovieLens 25M dataset: (a) average times between ratings of all users in the dataset, (b) number of ratings per user, (c) number of ratings per movie, and (d) number of ratings of a specific value that were cast by the users.
\Description

In-depth analysis histograms of the MovieLens 25M dataset: (a) average times between ratings of all users in the dataset, (b) number of ratings per user, (c) number of ratings per movie, and (d) number of ratings of a specific value that were cast by the users.

These findings suggest that most of the ratings were performed in a way that indicates that the users of the MovieLens website have mass-rated movies, rather than individually casting the ratings after watching each movie. The ratings per movie are also highly imbalanced, as most movies have few ratings and a few movies have a large number of ratings. This is actually somewhat expected, as there are only a few “blockbuster” movies that many people watch, while most movies are only watched by very few people. Finally, the ratings are heavily skewed towards more positive evaluations: ratings of 3.0 and higher are significantly more prevalent than those of 2.5 and below.

The in-depth analysis suggests that the MovieLens dataset may not be suitable for next watch predictions, as the mass-ratings imply that the temporal order does not necessarily coincide with the order in which the movies were watched. To avoid an ill-posed task from the start, a set of experiments were performed, where the user ratings were sorted in multiple ways: by timestamp in ascending order, by timestamp in descending order, by rating in ascending order, by rating in descending order, and in random order. The results of this experiment are shown in Figure 6. The findings reveal that ordering the movie watches by timestamp yields a higher prediction performance, which is measured in terms of top-100 accuracy, than any other ordering666Surprisingly, ordering the ratings from future to present, i.e., predicting past movie watches from future watch behavior, yields a slightly higher performance than the regular temporal ordering. A statistical fluke can be ruled out, as the experiment was repeated five times and the plot shows the minimum, maximum, and mean top-100 accuracy. We have no explanation for this interesting result.. Ordering by rating already gives a lower prediction performance, but it is still higher than the performance for random order. This means that, despite the ratings not conforming to “normal” viewing behavior, the dataset is actually suitable for the purposes of training the candidate generator, because the assumption of temporal cohesion holds.

Refer to caption
Figure 6. Validation top-100 accuracy results vs. number of epochs for different temporal orderings in the dataset.
\Description

Validation top-100 accuracy results vs. number of epochs for different temporal orderings in the dataset.

4.1.2. Dataset Preprocessing

The candidate generator and the ranker models each have different inputs and outputs, and therefore require a custom dataset that has to be derived from MovieLens. We refer to the dataset for the candidate generation model as the watch history dataset, and the dataset for the ranker model as the rating dataset.

The samples of the former consist of a list of previous movies that a user has watched and a single future movie as prediction target. Since the movie prediction is performed on a per-user basis, the dataset is first grouped by users. Watch histories are made from consecutive movie watches; therefore, the ratings are then ordered by their timestamp. A sliding window is used to extract watch history samples from the movie watches of the users. The preprocessing of the dataset is visualized in Figure 7. The created samples are then stored in a suitable data format for the training, validation, and testing of the candidate generator model.

Refer to caption
Figure 7. Preprocessing the MovieLens 25M dataset for the training of the candidate generator model.
\Description

Preprocessing the MovieLens 25M dataset for the training of the candidate generator model.

The use of a sliding window with a defined upper limit for the number of movies in a watch history is based on the premise that the users’ tastes change over time. This implies that a watch becomes less predictive of subsequent watches the longer it lies in the past. Furthermore, depending on the dataset size and the number of trainable parameters, the candidate generator model has an upper capacity limit for learning structure. For too high values of window size the candidate generator model performs worse as it is unable to learn the complex correlations in the input data. To determine the optimal window size, multiple datasets with different window sizes were created and used for training the models. The results suggest an optimal window size of 7 (cf. Figure 8).

Refer to caption
Figure 8. Determining the optimal window size for the MovieLens 25M dataset.
\Description

Determining the optimal window size for the MovieLens 25M dataset.

The rating dataset is much simpler, as the MovieLens samples do not have to be reinterpreted. Instead, the rating samples can be directly inferred from each MovieLens sample. Each of it consists of user ID, movie ID, genres of the movie, and user rating. Optionally, the age of the movie and the rating age can be added. The rating age is computed from the rating timestamps, while the movie age is derived from the movie release date, which was retrieved by cross-referencing the MovieLens movies with their corresponding entries in The Movie Database (TMDb). The movie age and the rating age are both normalized between -1 and 1. Adding the movie age should encourage the model to learn that certain users prefer older or newer movies. The rating age is used to provide the model with an understanding of the temporal component of ratings. During inference, the rating age can be set to 1 to ensure that the model does not take old information about the user into consideration, and thus makes predictions right at the end of the training window. A similar technique has been proposed by Covington et al. (2016). In order to determine the efficacy of adding these two features, experiments were performed, whose results are presented in Figure 9.

Refer to caption
Figure 9. Validation accuracy and MSE results vs. number of epochs for movie age, rating age, both, and none.
\Description

Validation accuracy and MSE results vs. number of epochs for movie age, rating age, both, and none.

Using the movie age yields the best overall accuracy, closely followed by using neither movie nor rating age. Utilizing either the rating age alone or the rating age and the movie age together, results in slower convergence of the model, as well as lower overall accuracy. In terms of MSE, using movie age, rating age, and using neither yield almost the same overall performance, while using both performs slightly worse. For this reason, we decided to only use the movie age and discard the rating age.

In order to perform FL experiments, both the watch history, as well as the rating datasets, were split into much smaller subsets for each FL client. Since the movie IDs, user IDs, and genres are fed into embedding layers, the datasets were not simply split randomly, but in a way, that the training data still contained all possible IDs. Otherwise, the validation and test subsets may end up containing IDs that the model was not trained on. For testing the FL pipeline, the datasets were randomly split into equal-sized subsets for all FL clients, which ensures that the client datasets are balanced and somewhat i.i.d.. As the MovieLens dataset also contains user IDs, the samples could be split such that each FL client receives samples of a single MovieLens user. This allows for properly simulating real-world conditions with non-i.i.d. data.

4.2. Baseline Experiments

We first conducted a baseline experiment using the hyperparameters that were selected based on the experiments described in Appendices B and C, as well as in Section 4.1. These experiments are used as a baseline for the FL experiments. We trained the candidate generator and the ranker models five times each and present their minimum, maximum, and mean performance in Figure 10(b).

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 10. Non-FL baseline experiment results for (10(a)) the candidate generator and (10(b)) the ranker.
\Description

Non-FL baseline experiment results for (10(a)) the candidate generator and (10(b)) the ranker.

The candidate generator outputs a probability distribution over the entire corpus of movies in the MovieLens dataset, which means that it has to distinguish between almost 60,000 classes. Therefore, we report top-100 accuracy (also sometimes referred to as hit-ratio@k𝑘kitalic_k, where k=100𝑘100k=100italic_k = 100), which rates a classification result as “correct” if the ground-truth next watched movie is among the 100 movies with the highest classification scores. For the ranker, we report accuracy, as well as MSE, which measures how much the predicted rating differs from its ground-truth. The performance was measured on a validation subset of the dataset, which is distinct from the training subset. The highest final top-100 accuracy that was achieved by any of the five trained candidate generators was 47.26%, with an average top-100 accuracy of 47.15%. The best performing ranker model achieved a final accuracy of 38.43% and a final MSE of 0.91, with a mean final accuracy of 38.31% and a mean final MSE of 0.93 across all five tries.

4.3. Federated Learning Experiments

We subsequently performed FL experiments by simulating the FL process. A detailed description of how this FL simulator operates can be found in Appendix A. The FL experiments use the same hyperparameter configuration as the baseline experiments, except for the learning rates of the candidate generator, which had to be decreased by one order of magnitude to stabilize the training. A broad range of different numbers of clients in the dataset were selected in order to simulate the impact of varied local data distributions on the performance of the global model. Different client sub-sampling rates were employed to determine the optimal number of clients per communication round for the individual scale of the experiments, ensuring accurate client updates to be aggregated by the central server. With a range of 1k to 150k clients, the underlying datasets were randomly split into equal-sized local datasets for the clients and randomly distributed among them, assuring that they have approximately the same local data-generating distribution, especially in the low scale experiments. As the number of clients grows, the sizes of the local datasets shrink, which in turn reduces the likelihood of receiving an i.i.d. subset of the underlying dataset and gradually increases the non-i.i.d.-ness of the client data. The 162k experiments split the underlying dataset using the user IDs provided by MovieLens, thereby ensuring that each FL client receives the samples from exactly one real-world user. As a result, each client’s local dataset has a unique data-generating distribution. Additionally, in the 162k setup, the local datasets are imbalanced, as the users of the MovieLens dataset have varying numbers of samples. Furthermore, with an increasing number of clients the local datasets become smaller, thus increasing the negative effects from noisy updates. This setup allows us to clearly identify the effects of small local datasets and non-i.i.d.-ness to be compared to our FedQ method. The results of these experiments are shown in Figure 11(b).

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 11. \Acfl experiment results. For increased legibility, instead of showing the complete training graphs, only (11(a)) the final validation top-100 accuracy, and (11(b)) the final validation accuracy and MSE are shown. For reference, the full training graphs are depicted in Appendix D.
\Description
\Ac

fl experiment results. For increased legibility, instead of showing the complete training graphs, only (11(a)) the final validation top-100 accuracy, and (11(b)) the final validation accuracy and MSE are shown. For reference, the full training graphs are depicted in Appendix D.

As can be seen in Figure 11(a), the candidate generator is strongly affected by non-i.i.d.-ness and small local datasets, as even the setup with only 1k clients already performs much worse compared to the non-FL baseline, and increasing the number of clients decreases the performance significantly. The ranker, which can be seen in Figure 11(b), is not as much affected by non-i.i.d.-ness and small local datasets. Additionally, the performance drop from increasing the number of clients is not as pronounced as with the candidate generator. Still, the performance is significantly lower than the non-FL baseline. The ranker performs better than the candidate generator, since the watch behavior varies more between users than rating behavior, e.g., two users with different watch histories may still rate the same movie similarly. Since the rating data is much more homogeneous, the data-generating distributions of the users do not differ as much as in the case of the watch history data.

Reasonably, one might expect that, as the number of clients grows and the sizes of the local datasets shrink, the performance of the candidate generator should gradually decline. The experiments, however, reveal that the performance declines between 1k and 100k clients, before increasing again with 150k and 162k clients. We believe this can be explained by viewing the performance penalty incurred by FL compared to centralized training as a compound error. One of the components of this error arises from the non-i.i.d.-ness of the clients, as the different local data generating distributions cause the clients to have disparate local objectives. This leads to contradicting client updates that, when averaged by the central server, can cancel out some of the training progress of other clients and result in an update to the global model that does not minimize the global objective. Another component of the error is caused by noisy client updates: The smaller the local dataset of a client is, the worse its estimation of the empirical loss becomes, which results in a noisy gradient and unstable training. Even with a homogeneous client population, this can lead to contradicting client updates, which cause the global model to not properly converge. In the setups with 1k and 10k clients, each client has a large local dataset, which results in stable local training and good client updates. This means that the compound error causing the decrease in performance is dominated by the increase in non-i.i.d.-ness. As the number of clients increases, the sizes of the local datasets decrease, which in turn increases the heterogeneity of the clients. But this decrease in the size of the local datasets also causes the client updates to become noisier and the error induced by noisy client updates to become a significant component of the compound error. This increase in both error components causes the sharp decline in performance between the setups with 10k and 100k clients. Between the setups with 100k, 150k, and 162k clients, the non-i.i.d.-ness and noisy client update error components do not significantly change, because the increase in the number of clients is not as large and the sizes of the local datasets do not change as much. At this point, a new error component comes into play: With the decrease of the local dataset sizes also comes a decrease in the number of different MovieLens users represented in the local datasets. In the setup with 100k clients the number of different MovieLens users represented in the local datasets becomes small enough that the global model starts to be negatively impacted by the heterogeneity of the local dataset samples. The 1k and 10k setups do not suffer from this, as their local datasets have samples from so many MovieLens users that the effect averages out. In the 150k setup the number of MovieLens users decreases and in the 162k setup it is even guaranteed that each client only has samples from a single MovieLens user, thus gradually decreasing the negative impact of the heterogeneity of the local dataset samples. Again, the ranker model is not as affected by this, since the rating data is much more homogeneous than the watch history data, as described above.

4.4. FedQ Experiments

As described in the previous section, the effects of non-i.i.d.-ness and small local datasets result in a significant decrease in performance. Therefore, we employed the FedQ technique, described in Section 3.6. We fixed the client sub-sampling rate at 1,000 clients per communication round and used varying queue lengths. In order to stabilize the training, the learning rate applied to the candidate generator was once again lowered in comparison to the non-FL baseline experiments. The experiments were also conducted using the FL simulator described in Appendix A, the results of which can be seen in Figure 12(b).

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 12. \Acfedq experiment results for (12(a)) the candidate generator and (12(b)) the ranker. For increased legibility, instead of showing the complete training graphs, only the final validation top-100 accuracy, and the final validation accuracy and MSE are shown. For reference, the full training graphs are depicted in Appendix D.
\Description
\Ac

fedq experiment results for (12(a)) the candidate generator and (12(b)) the ranker. For increased legibility, instead of showing the complete training graphs, only the final validation top-100 accuracy, and the final validation accuracy and MSE are shown. For reference, the full training graphs are depicted in Appendix D.

As shown in Figure 12(a), the candidate generator now performs much better as compared to standard FedAvg and in particular for the setups with 1k and 10k clients, FedQ even outperforms its baseline. The latter may be caused by a regularizing effect. In addition, the setups with large numbers of clients not only perform much better, but also in the expected way, as the performance slightly decreases with an increased number of clients. This provides evidence for our hypothesis that the candidate generator started to perform slightly better with increasing numbers of clients due to noisy updates induced by the decreased heterogeneity of the local dataset samples. Since the clients now train the global model sequentially, the number of samples adding to the local model update has drastically increased, thus resulting in much higher quality local updates. The ranker likewise shows a comparable improvement in performance and outperformed its non-FL baseline, though with a smaller margin than the candidate generator, as shown in Figure 12(b). Tables 1 and 2 compile the results of both the FL and FedQ experiments and clearly show that FedQ outperforms FedAvg in every single experiment.

1k Clients 10k Clients 100k Clients 150k Clients 162k Clients
Accuracy Accuracy Accuracy Accuracy Accuracy
\libertineSBSub-Sampling \libertineSBFedAvg
10 27.93% - - - -
100 28.29% 17.98% 5.18% 7.83% 10.14%
1,000 28.11% 17.86% 6.34% 8.53% 10.52%
10,000 - 17.91% 6.39% 6.68% 11.64%
\libertineSBQueue Length \libertineSBFedQ
10 49.61% 26.41% 17.59% 16.68% 17.01%
100 51.57% 48.80% 25.35% 22.66% 23.72%
1,000 52.12% 51.47% 42.34% 39.22% 38.67%
Table 1. Comparison of the candidate generator FL and FedQ experiment results. The table reports the final validation top-100 accuracies after 300 communication rounds.
\Description

Comparison of the candidate generator FL and FedQ experiment results. The table reports the final validation top-100 accuracies after 300 communication rounds.

1k Clients 10k Clients 100k Clients 150k Clients 162k Clients
Accuracy MSE Accuracy MSE Accuracy MSE Accuracy MSE Accuracy MSE
\libertineSBSub-Sampling \libertineSBFedAvg
10 27.78% 1.22 - - - - - - - -
100 27.91% 1.22 26.32% 1.32 24.27% 1.42 24.80% 1.42 23.14% 1.33
1,000 27.94% 1.21 26.47% 1.33 24.47% 1.42 24.43% 1.43 23.51% 1.34
10,000 - - 26.36% 1.33 24.74% 1.42 24.61% 1.42 22.77% 1.36
\libertineSBQueue Length \libertineSBFedQ
10 30.38% 1.09 27.85% 1.22 26.10% 1.34 26.10% 1.34 25.56% 1.34
100 39.45% 0.88 30.37% 1.1 27.43% 1.26 27.33% 1.26 27.15% 1.28
1,000 40.14% 0.83 39.07% 0.91 30.10% 1.12 29.98% 1.12 29.69% 1.13
Table 2. Comparison of the ranker FL and FedQ experiment results. The table reports the final validation accuracies and MSEs after 300 communication rounds.
\Description

Comparison of the ranker FL and FedQ experiment results. The table reports the final validation accuracies and MSEs after 300 communication rounds.

To further investigate the efficacy of FedQ, we have evaluated it using the LEAF benchmark (Caldas et al., 2019b), which is a benchmark for testing FL algorithms. The LEAF benchmark includes multiple different datasets that can naturally be partitioned into local datasets for FL clients, as well as accompanying NN models and metrics. We have benchmarked FedQ on four of LEAF’s datasets and NN architectures, which range from image classification using CNNs, to text classification and next word prediction using long short-term memory networks. The results of these experiments and a detailed evaluation can be found in Appendix E.

4.5. Communication Compression Experiments

As described in the Section 3.7, FL has, due to the continuous exchange of local updates between clients and central server, a significant communication overhead. We employed the recent NNC standard to compress the NN parametrizations communicated between the clients and the central server. The coding engine uses parameter quantization as a lossy preprocessing step and DeepCABAC as arithmetic coder. The quantization of the parameters requires a hyperparameter called quantization parameter (QP), which controls the step size δ𝛿\deltaitalic_δ between quantization points and thus the rate-performance trade-off. A lower QP results in a smaller step size and therefore in more quantization points and lower compression performance, while a higher QP results in a larger step size and therefore in less quantization points and higher compression performance. To compute δ𝛿\deltaitalic_δ as demonstrated by Algorithm 3, it is necessary to provide an additional parameter fQPsubscript𝑓𝑄𝑃f_{QP}italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT, which incorporates the dependency between QPs and the quantization step sizes. Lower values of fQPsubscript𝑓𝑄𝑃f_{QP}italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT result in larger neighboring quantization step sizes777https://github.com/fraunhoferhhi/nncodec/wiki/usage..

1
2
Input : QP𝑄𝑃QPitalic_Q italic_P is the quantization parameter and fQPsubscript𝑓𝑄𝑃f_{QP}italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT is the regulating parameter for mapping between QPs and quantization step sizes
Output : Quantization step size δ𝛿\deltaitalic_δ
3 m(1<<fQP)+(QP+((1<<fQP)1))𝑚much-less-than1subscript𝑓𝑄𝑃𝑄𝑃much-less-than1subscript𝑓𝑄𝑃1m\leftarrow\left(1<<f_{QP}\right)+\left(QP+\left(\left(1<<f_{QP}\right)-1% \right)\right)italic_m ← ( 1 < < italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT ) + ( italic_Q italic_P + ( ( 1 < < italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT ) - 1 ) )
4 sQP>>fQP𝑠𝑄𝑃much-greater-thansubscript𝑓𝑄𝑃s\leftarrow QP>>f_{QP}italic_s ← italic_Q italic_P > > italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT
5 δm2.0sfQP𝛿𝑚superscript2.0𝑠subscript𝑓𝑄𝑃\delta\leftarrow m\cdot 2.0^{s-f_{QP}}italic_δ ← italic_m ⋅ 2.0 start_POSTSUPERSCRIPT italic_s - italic_f start_POSTSUBSCRIPT italic_Q italic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
return δ𝛿\deltaitalic_δ
Algorithm 3 Quantization step size

Besides influencing the compression performance, the QP also impacts the performance of the NN model after decompression, i.e., if the QP was chosen too large, the resulting performance is significantly decreased. In order to determine the optimal value of the QP for the candidate generator and the ranker models, we performed an experiment, testing QP values between -48 and 0. The results are shown in Figure 13(b).

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 13. Compression vs. accuracy experiment results for (13(a)) the candidate generator and (13(b)) the ranker.
\Description

Compression vs. accuracy experiment results for (13(a)) the candidate generator and (13(b)) the ranker.

For FL, the QP value should be chosen in a way to optimize the rate-distortion trade-off. As can be seen in Figure 13(b), a QP range of -38 and -30 for the candidate generator, and -43 and -35 for the ranker results in compression rates with no or marginal performance degradation. Since the compression performance (per client) in our setting is independent from the number of clients, we only performed experiments with 100 clients and no client sub-sampling, i.e., all clients were included in every communication round. The experiments also perform FedQ with a queue length of 10. Besides the number of clients, the client sub-sampling rate, and the FedQ queue length, the other hyperparameters of the experimental setup are identical to the FL experiments in Section 4.3. The results of this experiment are depicted in Figure 14(b).

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 14. \Acfl with communication compression experiment results for (14(a)) the candidate generator and (14(b)) the ranker.
\Description
\Ac

fl with communication compression experiment results for (14(a)) the candidate generator and (14(b)) the ranker.

The plots on the left of each sub-figure show the model performance, while the plots on the right of each sub-figure demonstrate the compression performance for different QP values. According to the compression performance plots, the initial number of communicated MiB is slightly higher, as in the beginning weights are initialized with random values. During the course of training, the entropy of the weights decreases, resulting in better compression performance. After a few communication rounds, the compression performance saturates at an almost constant value. For the candidate generator, the space saving, as compared to uncompressed communication, varies between 92.97% for QP -38 and 95.37% for QP -30. For the ranker, the space saving, as compared to uncompressed communication, varies between 85.88% for QP -43 and 86.17% for QP -35. The space savings are lower in comparison to the non-FL baseline, where the candidate generator achieved 97.04% for QP -38 and 98.39% for QP -30, and the ranker 91.85% for QP -43 and 96.4% for QP -35. This seems to be an effect that is inherent to FL. In centralized training, regularization methods produce small magnitude weights, which results in higher sparsity when applying quantization. The exact weights that are going towards zero can, however, differ between several training runs, which means that in an FL setting each client can have different weights of small magnitude. Due to the averaging of the weights in FedAvg, the produced global model will most likely be less sparse than its constituent local models. For example, the overall entropy of the candidate generator that was trained using FL with compression is 4.41 bits, while the overall entropy of the candidate generator that was trained using FL without compression is 7.5 bits, and the baseline candidate generator which was trained centrally without compression has an entropy of 1.26 bits. The overall entropy of the ranker that was trained using FL with compression is 11.44 bits, while the overall entropy of the ranker that was trained using FL without compression is 12.59 bits and the baseline ranker, which was trained centrally without compression, has an entropy of 3.67 bits. This shows that models that are trained using FL do, in fact, have a higher entropy and are thus less amenable to compression. Quantization on the other hand seems to induce more sparsity, thus lowering the resulting entropy for models trained with compression. Furthermore, this also shows why the ranker performs much worse in terms of space saving, as it has much higher entropy in general.

The candidate generator has excellent model performance, even for higher QPs, well outperforming the non-FL baseline and showing the same performance characteristics as the FedQ experiments presented in Section 4.4. As the loss in performance and the increase in compression performance are very small, any of the tested QPs are well-suited to be used, so we selected a QP of -30, which offers the best overall space saving of 95.37% and a top-100 accuracy of 50.5%, which is only one percentage point smaller than the best accuracy and well above the non-FL baseline of 47.15%.

The training of the ranker model, however, seems to be much more affected by the compression as compared to the candidate generator, although the increase in compression performance is exceedingly small with increasing QPs. Only QPs -43 and -41 manage to meet the non-FL baseline and none of the QPs achieve a performance that is in line with the results of the FedQ experiments. This is, however, to be expected since the lossy compression of NNC may hurt the performance of the models. In this case, the difference between the best performance reached with compression is only slightly lower than the best performance of FedQ without compression. Therefore, QP -43 was selected as it slightly outperforms the non-FL baseline with 38.85% accuracy and an MSE of 0.91 but reaches almost the same space savings as the smallest QP with 85.88% as compared to 86.17%.

5. Conclusion & Outlook

Modern RecSys’s, especially the ones based on DL, benefit from increasing amounts of personal information about its users. This has resulted in the collection of substantial amounts of personal data on many platforms in recent years, leading to a data privacy problem. Here, \Acfl has emerged as a technique that intrinsically provides privacy and is therefore used in many scenarios where data privacy is of high priority. Consequently, we presented a movie RecSys, which is being trained end-to-end using FL and scales well to exceptionally large numbers of users. We have identified major problems in such systems and proposed solutions to them. In particular, we have shown that the non-i.i.d.-ness of the clients’ local datasets, as well as small local datasets can significantly degrade the federated training of a RecSys and developed a novel technique, called FedQ, which satisfactorily counteracts this problem. Furthermore, the substantial overhead of constantly communicating NN parametrizations between server and clients in FL poses a problem, especially when clients are connected via mobile internet connections. For this, we have shown that the most recent NNC compression technology can considerably reduce this communication overhead to a fraction of the uncompressed communication.

Beyond the proposed significant improvements to the overall RecSys, additional improvements can be achieved through further research. In the area of data privacy, differential privacy methods could be further investigated and combined with the quantization-induced privacy by NNC communication compression. Another topic of interest is the learning of embeddings in an FL setting, which is known to be problematic. Solutions proposed in the literature, seem to all depend on the partial disclosure of client data. Here, future work could investigate possibilities of learning embeddings in an FL setting without disclosing private information. The space savings of the compression can be further improved by differential compression, i.e., only the difference between the global model and the updated local model is compressed, which is sparser and is thus more amenable to compression. Finally, the non-i.i.d.-ness in the FedRec scenario originates from different user preferences. The local datasets within user groups of similar preference should be much more homogeneous, leading the way to further model performance improvements for FedRec’s.

References

  • (1)
  • Alam et al. (2016) Md. Hijbul Alam, Woo-Jong Ryu, and SangKeun Lee. 2016. Joint Multi-Grain Topic Sentiment. Information Sciences 339, C (April 2016), 206–223. https://doi.org/10.1016/j.ins.2016.01.013
  • Alamgir et al. (2022) Zareen Alamgir, Farwa K. Khan, and Saira Karim. 2022. Federated Recommenders: Methods, Challenges and Future. Cluster Computing 25, 6 (June 2022), 4075–4096. https://doi.org/10.1007/s10586-022-03644-w
  • Ammad-ud-din et al. (2019) Muhammad Ammad-ud-din, Elena Ivannikova, Suleiman A. Khan, Were Oyomno, Qiang Fu, Kuan Eeik Tan, and Adrian Flanagan. 2019. Federated Collaborative Filtering for Privacy-Preserving Personalized Recommendation System. arXiv e-prints abs/1901.09888 (Jan. 2019), 12 pages. arXiv:1901.09888 [cs.IR]
  • Asad et al. (2023) Muhammad Asad, Saima Shaukat, Ehsan Javanmardi, Jin Nakazato, and Manabu Tsukada. 2023. A Comprehensive Survey on Privacy-Preserving Techniques in Federated Recommendation Systems. Applied Sciences 13, 10 (2023), 26 pages. https://doi.org/10.3390/app13106201
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol, California, United States of America.
  • Brendan McMahan et al. (2018) H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2018. Learning Differentially Private Recurrent Language Models. In International Conference on Learning Representations. OpenReview.net, Vancouver, British Columbia, Canada, 14 pages. https://openreview.net/forum?id=BJ0hF1Z0b
  • Caldas et al. (2019a) Sebastian Caldas, J. Konečný, H. Brendan McMahan, and Ameet Talwalkar. 2019a. Expanding the Reach of Federated Learning by Reducing Client Resource Requirements. arXiv e-prints abs/1812.07210 (Jan. 2019). arXiv:1812.07210 [cs.LG]
  • Caldas et al. (2019b) Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. 2019b. LEAF: A Benchmark for Federated Settings. CoRR abs/1812.01097 (Dec. 2019). https://doi.org/10.48550/arXiv.1812.01097 arXiv:1812.01097 [cs.LG]
  • Cao et al. (2022) Mei Cao, Yujie Zhang, Zezhong Ma, and Mengying Zhao. 2022. C2S: Class-aware client selection for effective aggregation in federated learning. High-Confidence Computing 2, 3 (2022), 100068. https://doi.org/10.1016/j.hcc.2022.100068
  • Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, USA) (ICML ’07). Association for Computing Machinery, New York, NY, USA, 129–136. https://doi.org/10.1145/1273496.1273513
  • Chai et al. (2021) D. Chai, L. Wang, K. Chen, and Q. Yang. 2021. Secure Federated Matrix Factorization. IEEE Intelligent Systems 36, 05 (Sept. 2021), 11–20. https://doi.org/10.1109/MIS.2020.3014880
  • Chai et al. (2022) Di Chai, Leye Wang, Liu Yang, Junxue Zhang, Kai Chen, and Qiang Yang. 2022. FedEval: A Holistic Evaluation Framework for Federated Learning. arXiv e-prints abs/2011.09655 (Dec. 2022), 14 pages. https://doi.org/10.48550/arXiv.2011.09655 arXiv:2011.09655 [cs.LG]
  • Chen et al. (2019) Fei Chen, Mi Luo, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. 2019. Federated Meta-Learning with Fast Convergence and Efficient Communication. arXiv e-prints 1802.07876 (Dec. 2019). https://doi.org/10.48550/arXiv.1802.07876 arXiv:1802.07876 [cs.LG]
  • Chen et al. (2011) Tianqi Chen, Zhao Zheng, Qiuxia Lu, Weinan Zhang, and Yong Yu. 2011. Feature-Based Matrix Factorization. arXiv e-prints abs/1109.2271 (Dec. 2011), 12 pages. https://doi.org/10.48550/arXiv.1109.2271 arXiv:1109.2271 [cs.AI]
  • Chen et al. (2022) Wenlin Chen, Samuel Horváth, and Peter Richtárik. 2022. Optimal Client Sampling for Federated Learning. Transactions on Machine Learning Research 2022, 08 (2022), 32 pages. https://openreview.net/forum?id=8GvRCWKHIL
  • Cho et al. (2014) Kyunghyuna Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103–111. https://doi.org/10.3115/v1/W14-4012
  • Choe et al. (2021) Byeongjin Choe, Taegwan Kang, and Kyomin Jung. 2021. Recommendation System With Hierarchical Recurrent Neural Network for Long-Term Time Series. IEEE Access 9, 1 (2021), 72033–72039. https://doi.org/10.1109/ACCESS.2021.3079922
  • Choi et al. (2017) Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. 2017. Towards the Limit of Network Quantization. In International Conference on Learning Representations. OpenReview.net, Toulon, France, 14 pages. https://openreview.net/forum?id=rJ8uNptgl
  • Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. 2017. EMNIST: Extending MNIST to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN) (Anchorage, Alaska, United States of America). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 2921–2926. https://doi.org/10.1109/IJCNN.2017.7966217
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16). ACM (Association for Computer Machinery), New York, NY, USA, 191–198. https://doi.org/10.1145/2959100.2959190
  • Cramer et al. (2015) Ronald Cramer, Ivan Bjerre Damgård, and Jesper Buus Nielsen. 2015. Secure Multiparty Computation and Secret Sharing. Cambridge University Press, Cambridge, United Kingdom. https://doi.org/10.1017/CBO9781107337756
  • Dimitrov et al. (2022) Dimitar I. Dimitrov, Mislav Balunović, Nikola Konstantinov, and Martin Vechev. 2022. Data Leakage in Federated Averaging. arXiv e-prints abs/2206.12395 (2022). https://doi.org/10.48550/ARXIV.2206.12395
  • Dwork (2008) Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation, Manindra Agrawal, Dingzhu Du, Zhenhua Duan, and Angsheng Li (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–19.
  • Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3-4 (Aug. 2014), 211–407. https://doi.org/10.1561/0400000042
  • European Parliament (2016) European Parliament. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). European Union. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679
  • Fang and Quan (2021) Haokun Fang and Qian Quan. 2021. Privacy Preserving Machine Learning with Homomorphic Encryption and Federated Learning. Future Internet 13, 4 (2021), 94. https://doi.org/10.3390/fi13040094
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17). JMLR.org, 1269 Law Street, San Diego, CA 92109, 1126–1135.
  • Flanagan et al. (2021) Adrian Flanagan, Were Oyomno, Alexander Grigorievskiy, Kuan E. Tan, Suleiman A. Khan, and Muhammad Ammad-Ud-Din. 2021. Federated Multi-view Matrix Factorization for Personalized Recommendations. In Machine Learning and Knowledge Discovery in Databases, Frank Hutter, Kristian Kersting, Jefrey Lijffijt, and Isabel Valera (Eds.). Springer International Publishing, Ghent, Belgium, 324–347. https://doi.org/10.1007/978-3-030-67661-2_20
  • Fraboni et al. (2023) Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. 2023. A General Theory for Client Sampling in Federated Learning. In Trustworthy Federated Learning: First International Workshop, FL 2022, Held in Conjunction with IJCAI 2022, Vienna, Austria, July 23, 2022, Revised Selected Papers (Vienna, Austria). Springer-Verlag, Berlin, Heidelberg, 46–58. https://doi.org/10.1007/978-3-031-28996-5_4
  • Geiping et al. (2020) Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. 2020. Inverting Gradients - How Easy is It to Break Privacy in Federated Learning?. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 1421, 11 pages.
  • Gholami et al. (2022) Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2022. Low-Power Computer Vision (1st ed.). Chapman and Hall/CRC, New York, United States of America, Chapter A Survey of Quantization Methods for Efficient Neural Network Inference, 288–324. https://doi.org/10.1201/9781003162810
  • Go et al. (2009) Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter Sentiment Classification using Distant Supervision. CS224N Project Report. Stanford.
  • Golbeck (2016) Jennifer Golbeck. 2016. User Privacy Concerns with Common Data Used in Recommender Systems. In Social Informatics, Emma Spiro and Yong-Yeol Ahn (Eds.). Springer International Publishing, Cham, 468–480.
  • Gomez-Uribe and Hunt (2016) Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Trans. Manage. Inf. Syst. 6, 4, Article 13 (Dec. 2016), 19 pages. https://doi.org/10.1145/2843948
  • Grbovic and Cheng (2018) Mihajlo Grbovic and Haibin Cheng. 2018. Real-Time Personalization Using Embeddings for Search Ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 311–320. https://doi.org/10.1145/3219819.3219885
  • Grother and Hanaoka (1995) Patrick J. Grother and Kayee K. Hanaoka. 1995. NIST special database 19 handprinted forms and characters database. Technical Report. National Institute of Standards and Technology. https://doi.org/10.18434/T4H01C
  • Haase et al. (2021) Paul Haase, Daniel Becking, Heiner Kirchhoffer, Karsten Müller, Heiko Schwarz, Wojciech Samek, Detlev Marpe, and Thomas Wiegand. 2021. Encoder Optimizations For The NNR Standard On Neural Network Compression. In 2021 IEEE International Conference on Image Processing (ICIP) (Anchorage, Alaska, USA). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 3522–3526. https://doi.org/10.1109/ICIP42928.2021.9506655
  • Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In 4th International Conference on Learning Representations, ICLR, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). ICLR, San Juan, Puerto Rico. https://arxiv.org/abs/1510.00149
  • Hard et al. (2019) Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2019. Federated Learning for Mobile Keyboard Prediction. arXiv e-prints abs/1811.03604 (Feb. 2019), 7 pages. arXiv:1811.03604 [cs.CL]
  • Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (Dec. 2015), 19 pages. https://doi.org/10.1145/2827872
  • He et al. (2021) Chaoyang He, Keshav Balasubramanian, Emir Ceyani, Carl Yang, Han Xie, Lichao Sun, Lifang He, Liangwei Yang, Philip S. Yu, Yu Rong, Peilin Zhao, Junzhou Huang, Murali Annavaram, and Salman Avestimehr. 2021. FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks. In 9th International Conference on Learning Representations. OpenReview.net, Virtual Only, 17 pages.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web (Perth, Australia) (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 173–182. https://doi.org/10.1145/3038912.3052569
  • Hermann (2022) Erik Hermann. 2022. Artificial intelligence and mass personalization of communication content—An ethical and literacy perspective. New Media & Society 24, 5 (2022), 1258–1277. https://doi.org/10.1177/14614448211022702 arXiv:https://doi.org/10.1177/14614448211022702
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. Morgan-Kaufmann, Montréal, Québec, Canada. https://arxiv.org/abs/1503.02531
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
  • Hu et al. (2022) Ming Hu, Tian Liu, Zhiwei Ling, Zhihao Yue, and Mingsong Chen. 2022. FedCAT: Towards Accurate Federated Learning via Device Concatenation. arXiv e-prints abs/2202.12751 (Feb. 2022), 12 pages. arXiv:2202.12751 [cs.LG]
  • International Organization for Standardization (2022) (ISO) International Organization for Standardization (ISO). 2022. Information technology - Multimedia content description interface — Part 17: Compression of neural networks for multimedia content description and analysis. Standard. International Organization for Standardization (ISO), Geneva, Switzerland.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 1269 Law Street, San Diego, CA 92109, 448–456.
  • Jeong et al. (2023) Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. 2023. Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data. arXiv e-prints abs/1811.11479 (Oct. 2023), 6 pages. arXiv:1811.11479 [cs.LG]
  • Jia and Lei (2021) Junjie Jia and Zhipeng Lei. 2021. Personalized Recommendation Algorithm for Mobile Based on Federated Matrix Factorization. Journal of Physics: Conference Series 1802, 3 (March 2021), 032021. https://doi.org/10.1088/1742-6596/1802/3/032021
  • Jie et al. (2022) Zhiyong Jie, Shuhong Chen, Junqiu Lai, Muhammad Arif, and Zongyuan He. 2022. Personalized federated recommendation system with historical parameter clustering. Journal of Ambient Intelligence and Humanized Computing 14, 8 (02 2022), 10555–10565. https://doi.org/10.1007/s12652-022-03709-z
  • Kamp et al. (2023) Michael Kamp, Jonas Fischer, and Jilles Vreeken. 2023. Federated Learning from Small Datasets. In The Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda, 13 pages. https://openreview.net/forum?id=hDDV1lsRV8
  • Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, virtual, 5132–5143. https://proceedings.mlr.press/v119/karimireddy20a.html
  • Kiefer and Wolfowitz (1952) J. Kiefer and J. Wolfowitz. 1952. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics 23, 3 (1952), 462–466. https://www.jstor.org/stable/2236690
  • Kim et al. (2018) Jinsu Kim, Dongyoung Koo, Yuna Kim, Hyunsoo Yoon, Junbum Shin, and Sungwook Kim. 2018. Efficient Privacy-Preserving Matrix Factorization for Recommendation via Fully Homomorphic Encryption. ACM Trans. Priv. Secur. 21, 4, Article 17 (jun 2018), 30 pages. https://doi.org/10.1145/3212509
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). International Conference on Learning Representations, 2710 E Corridor Drive, Appleton, WI 54913. https://arxiv.org/abs/1412.6980
  • Kirchhoffer et al. (2022) Heiner Kirchhoffer, Paul Haase, Wojciech Samek, Karsten Müller, Hamed Rezazadegan-Tavakoli, Francesco Cricri, Emre B. Aksu, Miska M. Hannuksela, Wei Jiang, Wei Wang, Shan Liu, Swayambhoo Jain, Shahab Hamidi-Rad, Fabien Racapé, and Werner Bailer. 2022. Overview of the Neural Network Compression and Representation (NNR) Standard. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2022), 3203–3216. https://doi.org/10.1109/TCSVT.2021.3095970
  • Konečný et al. (2016) Jakub Konečný, Hugh Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. CoRR abs/1610.02527 (Oct. 2016), 38 pages. arXiv:1610.02527 https://arxiv.org/abs/1610.02527
  • Konečný et al. (2018) Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Ananda Theertha Suresh, Dave Bacon, and Peter Richtárik. 2018. Federated Learning: Strategies for Improving Communication Efficiency. In 6th International Conference on Learning Representations. OpenReview.net, Vancouver, British Columbia, Canada, 10 pages. https://openreview.net/forum?id=B1EPYJ-C-
  • Koren (2008) Yehuda Koren. 2008. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Las Vegas, Nevada, USA) (KDD ’08). Association for Computing Machinery, New York, NY, USA, 426–434. https://doi.org/10.1145/1401890.1401944
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30–37. https://doi.org/10.1109/MC.2009.263
  • Kozyreva et al. (2021) Anastasia Kozyreva, Philipp Lorenz-Spreen, Ralph Hertwig, Stephan Lewandowsky, and Stefan M Herzog. 2021. Public attitudes towards algorithmic personalization and use of personal data online: Evidence from Germany, Great Britain, and the United States. Humanities and Social Sciences Communications 8, 1 (2021), 1–11.
  • Lam et al. (2006) Shyong K. “Tony” Lam, Dan Frankowski, and John Riedl. 2006. Do You Trust Your Recommendations? An Exploration of Security and Privacy Issues in Recommender Systems. In Emerging Trends in Information and Communication Security, Günter Müller (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 14–29.
  • Lang and Shlezinger (2022) Natalie Lang and Nir Shlezinger. 2022. Joint Privacy Enhancement and Quantization in Federated Learning. In 2022 IEEE International Symposium on Information Theory (ISIT) (Aalto University, Espoo, Finland). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 2040–2045. https://doi.org/10.1109/ISIT50566.2022.9834551
  • Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791
  • LeCun et al. (1990) Yann LeCun, John Denker, and Sara Solla. 1990. Optimal Brain Damage. In Advances in Neural Information Processing Systems, D. Touretzky (Ed.), Vol. 2. Morgan-Kaufmann, Denver, Colorado, USA. https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf
  • Leroy et al. (2019) David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2019. Federated Learning for Keyword Spotting. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Brighton, United Kingdom). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 6341–6345. https://doi.org/10.1109/ICASSP.2019.8683546
  • Li et al. (2019) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2019. Federated Optimization for Heterogeneous Networks. In ICML Workshop on Adaptive & Multitask Learning: Algorithms & Systems. OpenReview.net, Long Beach, California, United States of America, 16 pages. https://openreview.net/forum?id=SkgwE5Ss3N
  • Li et al. (2020b) Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. 2020b. Fair Resource Allocation in Federated Learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, Addis Ababa, Ethiopia. https://openreview.net/forum?id=ByexElSYDr
  • Li et al. (2020a) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. 2020a. On the Convergence of FedAvg on Non-IID Data. In International Conference on Learning Representations. OpenReview.net, Addis Ababa, Ethiopia, 26 pages. https://openreview.net/forum?id=HJxNAnVtDS
  • Li et al. (2021) Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. 2021. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In International Conference on Learning Representations (ICLR) 2021. OpenReview.net, Vienna, Austria, 27 pages. https://openreview.net/forum?id=6YEQUn0QICG
  • Liang et al. (2021) Feng Liang, Weike Pan, and Zhong Ming. 2021. Fedrec++: Lossless federated recommendation with explicit feedback. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. AAAI Press, Washington, DC, USA, 4224–4231.
  • Lin et al. (2022) Bill Yuchen Lin, Chaoyang He, Zihang Ze, Hulin Wang, Yufen Hua, Christophe Dupuy, Rahul Gupta, Mahdi Soltanolkotabi, Xiang Ren, and Salman Avestimehr. 2022. FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States of America, 157–175. https://doi.org/10.18653/v1/2022.findings-naacl.13
  • Lin et al. (2016) Darryl D. Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. 2016. Fixed Point Quantization of Deep Convolutional Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML’16). JMLR.org, 1269 Law Street, San Diego, CA 92109, 2849–2858.
  • Lin et al. (2021a) Guanyu Lin, Feng Liang, Weike Pan, and Zhong Ming. 2021a. FedRec: Federated Recommendation With Explicit Feedback. IEEE Intelligent Systems 36, 5 (Sept. 2021), 21–30. https://doi.org/10.1109/MIS.2020.3017205
  • Lin et al. (2021b) Zhaohao Lin, Weike Pan, and Zhong Ming. 2021b. FR-FMSS: Federated Recommendation via Fake Marks and Secret Sharing. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY, USA, 668–673. https://doi.org/10.1145/3460231.3478855
  • Liu (2009) Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval 3, 3 (March 2009), 225–331. https://doi.org/10.1561/1500000016
  • Liu et al. (2020) Yang Liu, Yan Kang, Chaoping Xing, Tianjian Chen, and Qiang Yang. 2020. A Secure Federated Transfer Learning Framework. IEEE Intelligent Systems 35, 4 (2020), 70–82. https://doi.org/10.1109/MIS.2020.2988525
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In 2015 IEEE International Conference on Computer Vision (ICCV) (Santiago, Chile). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 3730–3738. https://doi.org/10.1109/ICCV.2015.425
  • Lloyd (1982) S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137. https://doi.org/10.1109/TIT.1982.1056489
  • Luo et al. (2021) Jiahuan Luo, Xueyang Wu, Yun Luo, Anbu Huang, Yunfeng Huang, Yang Liu, and Qiang Yang. 2021. Real-World Image Datasets for Federated Learning. arXiv e-prints abs/1910.11089 (Jan. 2021), 8 pages. arXiv:1910.11089 [cs.CV]
  • MacKenzie et al. (2013) Ian MacKenzie, Chris Meyer, and Steve Noble. 2013. How retailers can keep up with consumers. McKinsey & Company. https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers
  • McMahan et al. (2017) Hugh Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Aarti Singh and Jerry Zhu (Eds.), Vol. 54. JMLR, Inc. and Microtome Publishing, Fort Lauderdale, Florida, USA, 1273–1282.
  • Minto et al. (2021) Lorenzo Minto, Moritz Haller, Benjamin Livshits, and Hamed Haddadi. 2021. Stronger privacy for federated collaborative filtering with implicit feedback. In Proceedings of the 15th ACM Conference on Recommender Systems. ACM (Association for Computer Machinery), New York, NY, USA, 342–350.
  • Moving Picture Experts Group working group of ISO/IEC(2021) (MPEG) Moving Picture Experts Group (MPEG) working group of ISO/IEC. 2021. MPEG-7: Compression of Neural Networks for Multimedia Content Description and analysis. Standard. Moving Picture Experts Group (MPEG) working group of ISO/IEC, Hannover, DE.
  • Muhammad et al. (2020) Khalil Muhammad, Qinqin Wang, Diarmuid O’Reilly-Morgan, Elias Tragos, Barry Smyth, Neil Hurley, James Geraci, and Aonghus Lawlor. 2020. FedFast: Going Beyond Average for Faster Training of Federated Recommender Systems. In KDD ’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, New York, USA, 1234–1242. https://doi.org/10.1145/3394486.3403176
  • Neumann et al. (2020) David Neumann, Felix Sattler, Heiner Kirchhoffer, Simon Wiedemann, Karsten Müller, Heiko Schwarz, Thomas Wiegand, Detlev Marpe, and Wojciech Samek. 2020. DeepCABAC: Plug&Play Compression of Neural Network Weights and Weight Updates. In IEEE International Conference on Image Processing, ICIP 2020, October 25-28, 2020. IEEE, Abu Dhabi, United Arab Emirates, 21–25. https://doi.org/10.1109/ICIP40778.2020.9190821
  • Ovi et al. (2023) Pretom Roy Ovi, Emon Dey, Nirmalya Roy, and Aryya Gangopadhyay. 2023. Mixed Quantization Enabled Federated Learning to Tackle Gradient Inversion Attacks. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Institute of Electrical and Electronics Engineers (IEEE), Vancouver, British Columbia, Canada, 5046–5054. https://doi.org/10.1109/CVPRW59228.2023.00533
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., Vancouver, British Columbia, Canada, 8024–8035. https://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • Perifanis and Efraimidis (2022) Vasileios Perifanis and Pavlos S. Efraimidis. 2022. Federated Neural Collaborative Filtering. Know.-Based Syst. 242, C (April 2022), 16 pages. https://doi.org/10.1016/j.knosys.2022.108441
  • Phong et al. (2018) Le Trieu Phong, Yoshinori Aono, Takuya Hayashi, Lihua Wang, and Shiho Moriai. 2018. Privacy-Preserving Deep Learning via Additively Homomorphic Encryption. IEEE Transactions on Information Forensics and Security 13, 5 (2018), 1333–1345. https://doi.org/10.1109/TIFS.2017.2787987
  • Reisizadeh et al. (2020) Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. 2020. FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 108), Silvia Chiappa and Roberto Calandra (Eds.). PMLR, Online, 2021–2031. https://proceedings.mlr.press/v108/reisizadeh20a.html
  • Ribero et al. (2022) Mónica Ribero, Jette Henderson, Sinead Williamson, and Haris Vikalo. 2022. Federating Recommendations Using Differentially Private Prototypes. Pattern Recogn. 129, C (Sept. 2022), 14 pages. https://doi.org/10.1016/j.patcog.2022.108746
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22, 3 (1951), 400–407. https://www.jstor.org/stable/2236626
  • Rønn Hansen et al. (2022) Christian Rønn Hansen, Gareth Price, Matthew Field, Nis Sarup, Ruta Zukauskaite, Jørgen Johansen, Jesper Grau Eriksen, Farhannah Aly, Andrew McPartlin, Lois Holloway, David Thwaites, and Carsten Brink. 2022. Larynx cancer survival model developed through open-source federated learning. Radiotherapy and Oncology 176, 1 (Nov. 2022), 179–186. https://doi.org/10.1016/j.radonc.2022.09.023
  • Sattler et al. (2019) Felix Sattler, Simon Wiedemann, Klaus Robert Müller, and Wojciech Samek. 2019. Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication. In 2019 International Joint Conference on Neural Networks, IJCNN 2019 (Proceedings of the International Joint Conference on Neural Networks). Institute of Electrical and Electronics Engineers Inc., Budapest, Hungary. https://doi.org/10.1109/IJCNN.2019.8852172
  • Sattler et al. (2020) Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. 2020. Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data. IEEE Transactions on Neural Networks and Learning Systems 31, 9 (2020), 3400–3413. https://doi.org/10.1109/TNNLS.2019.2944481
  • Schrage (2017) Michael Schrage. 2017. Great Digital Companies Build Great Recommendation Engines. Harvard Business Review. https://hbr.org/2017/08/great-digital-companies-build-great-recommendation-engines
  • Schwartz (2004) Barry Schwartz. 2004. The Tyranny of Choice. Scientific American 290, 4 (April 2004), 70–75. https://doi.org/10.1038/scientificamerican0404-70
  • Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. AutoRec: Autoencoders Meet Collaborative Filtering. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (WWW ’15 Companion). Association for Computing Machinery, New York, NY, USA, 111–112. https://doi.org/10.1145/2740908.2742726
  • Seol and Kim (2023) Mihye Seol and Taejoon Kim. 2023. Performance Enhancement in Federated Learning by Reducing Class Imbalance of Non-IID Data. Sensors 23, 3 (2023), 16 pages. https://doi.org/10.3390/s23031152
  • Shakespeare (1994) William Shakespeare. 1994. The Complete Works of William Shakespeare. Project Gutenberg, Vol. 100. Project Gutenberg, P.O. Box 2782, Champaign, IL 61825-2782, USA. https://www.gutenberg.org/ebooks/100
  • Shamir (1979) Adi Shamir. 1979. How to share a secret. Commun. ACM 22, 11 (1979), 612–613.
  • Sherstinsky (2020) Alex Sherstinsky. 2020. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena 404, 1 (March 2020), 132306. https://doi.org/10.1016/j.physd.2019.132306
  • Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. Privacy-Preserving Deep Learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (Denver, Colorado, USA) (CCS ’15). Association for Computing Machinery, New York, NY, USA, 1310–1321. https://doi.org/10.1145/2810103.2813687
  • Smith et al. (2022) Jessie J. Smith, Lucia Jayne, and Robin Burke. 2022. Recommender Systems and Algorithmic Hate. In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys ’22). Association for Computing Machinery, New York, NY, USA, 592–597. https://doi.org/10.1145/3523227.3551480
  • Stoll (2022) Julia Stoll. 2022. Devices used to watch online video on demand (VOD) worldwide in 1st quarter 2022 and 2nd quarter 2022. Statista. https://www.statista.com/statistics/1329449/vod-device-usage-share-worldwide/
  • Sun et al. (2022) Tao Sun, Dongsheng Li, and Bao Wang. 2022. Adaptive Random Walk Gradient Descent for Decentralized Optimization. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, Baltimore, Maryland, USA, 20790–20809. https://proceedings.mlr.press/v162/sun22b.html
  • Sun et al. (2023) Zehua Sun, Yonghui Xu, Yong Liu, Wei He, Lanju Kong, Fangzhao Wu, Yali Jiang, and Lizhen Cui. 2023. A Survey on Federated Recommendation Systems. arXiv e-prints 2301.00767 (March 2023), 15 pages. https://doi.org/10.48550/arXiv.2301.00767 arXiv:2301.00767 [cs.IR]
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). Association for Computing Machinery, New York, NY, USA, 565–573. https://doi.org/10.1145/3159652.3159656
  • Triastcyn et al. (2022) Aleksei Triastcyn, Matthias Reisser, and Christos Louizos. 2022. Decentralized Learning with Random Walks and Communication-Efficient Adaptive Optimization. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022). NeurIPS, New Orleans, LA, USA.
  • Wainakh et al. (2019) Aidmar Wainakh, Tim Grube, Jörg Daubert, and Max Mühlhäuser. 2019. Efficient Privacy-Preserving Recommendations Based on Social Graphs. In Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing Machinery, New York, NY, USA, 78–86. https://doi.org/10.1145/3298689.3347013
  • Wang et al. (2022) Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, and Tong Zhang. 2022. On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data. arXiv e-prints abs/2206.04723 (June 2022), 21 pages. https://doi.org/10.48550/arXiv.2206.04723 arXiv:2206.04723 [cs.LG]
  • Wang et al. (2021) Shuai Wang, Richard Cornelius Suwandi, and Tsung-Hui Chang. 2021. Demystifying Model Averaging for Communication-Efficient Federated Matrix Factorization. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Toronto, Ontario, Canada). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 3680–3684. https://doi.org/10.1109/ICASSP39728.2021.9413927
  • Wang et al. (2023) Yanmeng Wang, Qingjiang Shi, and Tsung-Hui Chang. 2023. Batch Normalization Damages Federated Learning on NON-IID Data: Analysis and Remedy. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095399
  • Wei et al. (2020a) Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H. Yang, Farhad Farokhi, Shi Jin, Tony Q. S. Quek, and H. Vincent Poor. 2020a. Federated Learning With Differential Privacy: Algorithms and Performance Analysis. Trans. Info. For. Sec. 15, 1 (Jan. 2020), 3454–3469. https://doi.org/10.1109/TIFS.2020.2988575
  • Wei et al. (2020b) Wenqi Wei, Ling Liu, Margaret Loper, Ka-Ho Chow, Mehmet Emre Gursoy, Stacey Truex, and Yanzhao Wu. 2020b. A Framework for Evaluating Client Privacy Leakages in Federated Learning. In Computer Security – ESORICS 2020, Liqun Chen, Ninghui Li, Kaitai Liang, and Steve Schneider (Eds.). Springer International Publishing, Cham, 545–566.
  • Weissenbacher et al. (2018) Davy Weissenbacher, Abeed Sarker, Michael J. Paul, and Graciela Gonzalez-Hernandez. 2018. Overview of the Third Social Media Mining for Health (SMM4H) Shared Tasks at EMNLP 2018. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task. Association for Computational Linguistics, Brussels, Belgium, 13–16. https://doi.org/10.18653/v1/W18-5904
  • Wiedemann et al. (2020a) Simon Wiedemann, Heiner Kirchhoffer, Stefan Matlage, Paul Haase, Arturo Marban, Talmaj Marinč, David Neumann, Tung Nguyen, Heiko Schwarz, Thomas Wiegand, Detlev Marpe, and Wojciech Samek. 2020a. DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks. IEEE Journal of Selected Topics in Signal Processing 14, 4 (2020), 700–714. https://doi.org/10.1109/JSTSP.2020.2969554
  • Wiedemann et al. (2020b) Simon Wiedemann, Heiner Kirchhoffer, Stefan Matlage, Paul Haase, Arturo Marban, Talmaj Marinč, David Neumann, Tung Nguyen, Heiko Schwarz, Thomas Wiegand, Detlev Marpe, and Wojciech Samek. 2020b. DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks. IEEE Journal of Selected Topics in Signal Processing 14, 4 (2020), 700–714. https://doi.org/10.1109/JSTSP.2020.2969554
  • Wu et al. (2022) Chuhan Wu, Fangzhao Wu, Lingjuan Lyu, Yongfeng Huang, and Xing Xie. 2022. Communication-efficient federated learning via knowledge distillation. Nature Communications 13, 1 (April 2022), 8 pages. https://doi.org/10.1038/s41467-022-29763-x
  • Wu et al. (2020) Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. 2020. MIND: A Large-scale Dataset for News Recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3597–3606. https://doi.org/10.18653/v1/2020.acl-main.331
  • Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (San Francisco, California, USA) (WSDM ’16). Association for Computing Machinery, New York, NY, USA, 153–162. https://doi.org/10.1145/2835776.2835837
  • Wu and He (2020) Yuxin Wu and Kaiming He. 2020. Group Normalization. International Journal of Computer Vision 128, 3 (01 Mar 2020), 742–755. https://doi.org/10.1007/s11263-019-01198-w
  • Yang et al. (2021) Enyue Yang, Yunfeng Huang, Feng Liang, Weike Pan, and Zhong Ming. 2021. FCMF: Federated collective matrix factorization for heterogeneous collaborative filtering. Knowledge-Based Systems 220, 1 (March 2021), 106946. https://doi.org/10.1016/j.knosys.2021.106946
  • Yelp (2021) Yelp. 2021. Yelp Dataset. Yelp Inc. https://www.yelp.com/dataset
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 974–983. https://doi.org/10.1145/3219819.3219890
  • Yue et al. (2023) Kai Yue, Richeng Jin, Chau-Wai Wong, Dror Baron, and Huaiyu Dai. 2023. Gradient Obfuscation Gives a False Sense of Security in Federated Learning. In Proceedings of the 32nd USENIX Conference on Security Symposium (Anaheim, California, United States of America) (SEC ’23). USENIX Association, USA, Article 357, 18 pages.
  • Zaccone et al. (2022) Riccardo Zaccone, Andrea Rizzardi, Debora Caldarola, Marco Ciccone, and Barbara Caputo. 2022. Speeding up Heterogeneous Federated Learning with Sequentially Trained Superclients. In 2022 26th International Conference on Pattern Recognition (ICPR) (Montréal, Québec, Canada). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 3376–3382. https://doi.org/10.1109/ICPR56361.2022.9956084
  • Zhang et al. (2022) Honglei Zhang, Fangyuan Luo, Jun Wu, Xiangnan He, and Yidong Li. 2022. LightFR: Lightweight Federated Recommendation with Privacy-Preserving Matrix Factorization. ACM Trans. Inf. Syst. 41, 2 (Dec. 2022), 1–28. https://doi.org/10.1145/3578361 Just Accepted.
  • Zhang and Jiang (2021) JianFei Zhang and YuChen Jiang. 2021. A vertical federation recommendation method based on clustering and latent factor model. In 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS). Institute of Electrical and Electronics Engineers (IEEE), 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA, 362–366. https://doi.org/10.1109/EIECS53707.2021.9587935
  • Zhao et al. (2020) Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2020. iDLG: Improved Deep Leakage from Gradients. arXiv e-prints abs/2001.02610 (Jan. 2020), 5 pages. https://doi.org/10.48550/arXiv.2001.02610 arXiv:2001.02610 [cs.LG]
  • Zhu et al. (2021) Hangyu Zhu, Jinjin Xu, Shiqing Liu, and Yaochu Jin. 2021. Federated Learning on Non-IID Data: A Survey. Neurocomput. 465, C (Nov. 2021), 371–390. https://doi.org/10.1016/j.neucom.2021.07.098
  • Zhu et al. (2019) Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep Leakage from Gradients. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Vancouver, British Columbia, Canada. https://proceedings.neurips.cc/paper/2019/file/60a6c4002cc7b29142def8871531281a-Paper.pdf
ANE
Apple Neural Engine
BatchNorm
batch normalization
BLSTM
bi-directional long short-term memory
BRNN
bi-directional recurrent neural network
CelebA
Large-scale CelebFaces Attributes Dataset
CNN
convolutional neural network
DeepCABAC
Deep Context-Adaptive Binary Arithmetic Coding
DL
deep learning
DNN
deep neural network
EMNIST
Extended MNIST
EU
European Union
FedAvg
federated averaging
FedCat
federated learning via device concatenation
FedDC
federated daisy-chaining
FedQ
federated learning with client queuing
FedRec
federated recommender system
FedSeq
federated learning via sequential superclients training
FedSGD
federated stochastic gradient descent
FEMNIST
Federated EMNIST
FL
federated learning
GCN
graph convolutional network
GDPR
General Data Protection Regulation
GloVe
Global Vectors for Word Representation
GroupNorm
group normalization
GRU
gated recurrent unit
i.i.d.
independent and identically distributed
LSTM
long short-term memory
MAML
model-agnostic meta-learning
ML
machine learning
MNIST
Modified NIST Database
MSE
mean squared error
NCF
neural collaborative filtering
NIST
National Institute of Standards and Technology
NN
neural network
NNC
neural network coding
PSNR
peak signal-to-noise ratio
QP
quantization parameter
RecSys
recommender system
ReLU
rectified linear unit
RNN
recurrent neural network
SGD
stochastic gradient descent
SoC
System on a Chip
TMDb
The Movie Database
UML
Unified Modeling Language
VoD
video-on-demand

Appendix A Federated Learning Simulator

The unique requirements for the FedRec presented in this work, especially the considerable number of FL clients involved, render its experimental evaluation exceedingly difficult. Performing the experiments under real-world conditions, i.e., deploying real devices that communicate with the central server via a network connection, was infeasible. Simulating the FL process is common in research, since most of the time the algorithmic and methodological underpinnings of the process are to be researched. Simulating an FL system with the required number of clients proved to be challenging, however, as simultaneously keeping all clients, their local datasets, and their local models in main memory is impractical. This meant that some concessions had to be made in order to be able to perform the simulations.

The first and most obvious concession is that the clients must be strictly trained sequentially, which increases training times considerably, but allows for the training of the clients on limited computing resources. This enables the simulator to run on hardware whose processing capabilities allow for the training of at least one client.

The second concession is that the clients cannot remain in main memory at the same time. In fact, because of the potential high amount of data involved in the training process, intermediate results can also not be stored on hard disks. \Acfl clients are usually comprised of the following data: a local dataset, a local model, and the data required by the optimizer. At the very least, an optimizer must store the gradient of the loss function with respect to the weights, whose size is equal to the size of the NN model itself. Furthermore, some optimization algorithms require the storage of additional information. For example, the Adam optimizer (Kingma and Ba, 2015) also stores estimates of the first and second moments of the gradient, which are both equal in size to the gradient. Finally, the central server must store the global model, as well as all client updates, which are each equal in size to the global model. Although a single RecSys NN model is only tens of Megabytes in size, this can add up to an inhibitive amount of data. For example, in the federated training of the candidate generator NN model with more than 162.000 clients, the amount of storage required for all clients adds up to approximately 32 terabytes when using the regular SGD optimizer (Robbins and Monro, 1951; Kiefer and Wolfowitz, 1952) and approximately 52 terabytes when using the Adam optimizer. Considering these memory requirements, it becomes obvious that this is not a viable option for the simulation.

The FL simulator employs multiple simple improvements to circumvent the need to keep all data in the main memory simultaneously. Since the clients are trained sequentially, they do not have to keep their local datasets and models in memory simultaneously. At the beginning of each training round, the central server sends the parameters of the global model to the current client, which will load its local dataset and instantiate a local model using the parameters of the global model before starting training. After the local training has finished, the client sends the updated parameters of its local model back to the central server and frees up the memory resources occupied by its local dataset and model.

The second improvement is to not store all client updates on the central server and only aggregate them when all client updates have been received. Instead, a cumulative mean is kept in memory, which is updated each time the central server receives an update from a client. Since the client updates are weighted by the amount of data each client has trained on, this means that the central server must ask all participating clients at the beginning of each training round to reveal the size of their local dataset. In a real-world scenario, each client would send this information when it sends its training updates to the central server, but to be able to keep a cumulative mean, the server must know the local dataset sizes of all clients in advance. The central server then computes the percentage of training data every client contributes to the overall training, which it uses as the weights for the cumulative mean. When a client finishes its local training and sends the training updates back to the central server, the central server multiplies the received update by the weight of the respective client and adds it to the cumulative mean. After all clients have finished their training and the central server has aggregated all their contributions, the cumulative mean is equal to the actual mean of the client updates, which is then used as the updated parametrization of the global model.

Finally, the last improvement is to only use the SGD optimizer, as it is stateless and requires no memory at all beyond having to store the gradient. However, since the gradient is volatile and only needs to be stored until it has been applied to the weights of the local model, its memory footprint is as low as possible. Other optimizers, such as Adam, need to store further information, which cannot be discarded and must be kept in memory for the entire duration of the federated training, thus rendering these optimizers unusable. Depending on the model and the training objective, the choice of optimizer can influence convergence time, as well as the final performance of the model. In the present case SGD can train the RecSys models to the same level of performance in a similar amount of time as Adam.

Employed in unison, these improvements not only bring the computational requirements down to a manageable level, but they also reduce the memory footprint to a small fraction of the theoretical requirements. In fact, only the validation dataset and model of the central server, the cumulative mean of the client updates, the local training dataset and model of the current client, and the gradient calculated by the current client’s optimizer will ever be in memory at the same time. This reduces the memory requirements of the FL simulator from several tens of terabytes down to a few gigabytes.

The temporal complexity of the FL simulator is still relatively high, however. Although a round of local training of a single client only requires a couple of minutes, this adds up to a substantial amount of training time considering that the federated training procedure must be repeated dozens or even hundreds of times. This is a problem that is also faced by real-world FL systems and is usually solved by client sub-sampling (Fraboni et al., 2023; Chen et al., 2022), i.e., only selecting a small random subset of clients from the client population for each round of training. This is also the solution that we have chosen: For each round of federated training, we only select between 100 and 10,000 from the more than 162,000 clients. In conclusion, all these measures make it possible to train the RecSys models using FL in a matter of a few days.

Appendix B Candidate Generator Experiments

In this appendix we will perform various experiments to determine the optimal model architecture for the candidate generator model described in Section 3.3, and then provide a detailed explanation of the chosen NN architecture.

B.1. Model Type Experiment

There are many possible architectures for candidate generator models based on NNs ranging from simple DNNs (Covington et al., 2016) and RNNs (Choe et al., 2021) to more elaborate autoencoder architectures (Wu et al., 2016). Since the NN model will be trained using FL, the size of the model is a crucial factor. Mobile devices, such as smartphones, are likely candidates for training the RecSys, as smartphones are the most used devices to watch online VoD content (Stoll, 2022). Although some modern smartphones even have dedicated hardware for NN training and inference888For example, the Apple Neural Engine (ANE) introduced with the iPhone X’s A11 System on a Chip (SoC) and Google’s Tensor SoC introduced with the Pixel 6 line of smartphones., they are still very resource-constrained as compared to contemporary ML hardware. Therefore, only the simplest architectures can be considered for the candidate generator. The most basic NN architecture are feed-forward fully-connected DNNs. However, as the candidate generator will be trained on time-series data, RNNs would be a more appropriate choice. Therefore, an experiment with a simple feed-forward fully-connected architecture and multiple simple recurrent architectures, including plain RNNs (Sherstinsky, 2020), LSTM (Hochreiter and Schmidhuber, 1997) networks, and gated recurrent units (Cho et al., 2014), was conducted. The recurrent architectures were all trained as both unidirectional and bidirectional models. The results of this experiment are shown in Figure 15.

Refer to caption
Figure 15. Validation top-100 accuracy results vs. number of epochs for different candidate generator model types.
\Description

Validation top-100 accuracy results vs. number of epochs for different candidate generator model types.

The LSTM and the GRU have the worst average performance of the tested model architectures. They clearly show the pitfalls of recurrent architectures: Although the best-performing recurrent architectures reach best-in-class performances, they are tricky to train and show a large variance in training performance. Surprisingly, the RNNs are the highest performing among the recurrent architectures. Generally, the bi-directional versions of the recurrent architectures outperform their unidirectional counterpart. The feed-forward fully-connected model (denoted as DNN) reaches an acceptable performance, which is almost as high as that of the bi-directional recurrent neural network (BRNN) or the bi-directional long short-term memory (BLSTM). Just considering the performance of the tested architectures, the BRNN should be favored, but it also has its downsides: (1) it is the slowest to converge with an average wall clock time of roughly 200 hours as compared to an average wall clock time of roughly 55 hours for the DNN, which is almost 4 times as long, and (2) the complexity of the two architectures differs significantly, while the DNN only has 17,994,852 trainable parameters, the BRNN has 128,494,436 trainable parameters, which is more than 7 times as many. The same is true for the BLSTM: It is much slower to converge in terms of wall clock time and is significantly larger. Especially considering that the model must be trained on resource-constrained devices, the simpler but also well-performing DNN architecture was selected.

B.2. Movie Embedding Layer Size Experiment

The size of the embedding vectors has a substantial impact on the classification result: they cannot fully capture the latent information from the data when they are too small. Additionally, there is a computational cost and a risk of overfitting when they are too large, which means that more data (or regularization) is needed to properly train the model. We determined the optimal size of the embedding vectors experimentally by testing different sizes, as shown in Figure 16. The results demonstrate, that increasing the size of the movie embedding vectors directly results in a performance gain, but the return on investment falls off quickly: While doubling from a size of 32 to a size of 64 results in a sizable performance increase of roughly 0.83 percentage points on average, doubling it again to 128 only yields a rise of roughly 0.12 percentage points on average. This means that a 64-dimensional embedding vector provides the best trade-off between performance and model size.

Refer to caption
Figure 16. Validation top-100 accuracy results vs. number of epochs for different numbers of dimensions of the movie embedding layer in the candidate generator model.
\Description

Validation top-100 accuracy results vs. number of epochs for different numbers of dimensions of the movie embedding layer in the candidate generator model.

B.3. Number of Hidden Layers Experiment

Likewise, the number of hidden layers in the candidate generator model also impacts both the performance, as well as the size of the resulting model. We performed an experiment with varying numbers of hidden layers. The results are shown in Figure 17, giving an optimum of a 3-layer configuration, as both increasing and decreasing the number of hidden layers results in inferior performance.

Refer to caption
Figure 17. Validation top-100 accuracy results vs. number of epochs for different numbers of hidden layers of the candidate generator model.
\Description

Validation top-100 accuracy results vs. number of epochs for different numbers of hidden layers of the candidate generator model.

B.4. Candidate Generator Model Architecture

The final NN architecture that was chosen for the candidate generator has a 64-dimensional embedding layer for the movies in the watch history inputs, followed by three hidden fully-connected layers, which are each followed by a normalization layer and a ReLU activation. The hidden layers with their normalization layers and ReLU activations are then followed by an output fully-connected layer, which feeds its logits into a softmax.

Ever since its introduction, BatchNorm (Ioffe and Szegedy, 2015) has been a mainstay in deep learning. Today it is used in a wide variety of NN architectures. Unfortunately, BatchNorm also comes with some drawbacks. First and foremost, BatchNorm normalizes along the batch dimension, which causes problems with small batch sizes as the estimation of the statistics of a batch become more error prone the smaller the batch size becomes, which can make the training unstable. Especially in FL the clients tend to use small batch sizes because of the limited computing power. \Acgroupnorm (Wu and He, 2020) was introduced to deal with this problem. Instead of estimating the mean and variance of the data based on the batches, it divides the data into groups and measures the statistics within these groups. This makes group normalization (GroupNorm) independent of the batch size.

Secondly, BatchNorm keeps, besides its two trainable parameters γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β, a running average of the mean and the variance of the batches. This makes it complicated to use in FL, as the running averages of the mean and the variance cannot be simply averaged. Li et al. (2021) propose to only communicate the trainable parameters of BatchNorm, i.e., γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β, to the central server for aggregation but keep the running average of the batch statistics local. \Acgroupnorm, however, has the edge over BatchNorm in this case, as it does not keep a running average of the data statistics and instead always estimates them from the current input.

Wang et al. have performed a convergence analysis and were able to show that, although several schemes have been proposed to remedy the problems of BatchNorm in FL, most of them still suffer a loss in performance due to the fact that a mismatch between the local and global statistics, incurred by non-i.i.d. data distributions, causes a gradient deviation, which in turn leads the model to converge to a biased solution with a slower rate. To avoid all of the above-mentioned problems, we have decided to use GroupNorm for all FL experiments and BatchNorm for all non-FL experiments. A detailed breakdown of the layers that comprise the NN architecture of the candidate generator model is presented in Table 3.

Type Shape Parameters
Embedding Layer 53,797×64537976453,797\times 6453 , 797 × 64 3,443,008
  Fully-Connected Layer Weights: 1,024×641024641,024\times 641 , 024 × 64 66,560
Bias: 1,02410241,0241 , 024
  BatchNorm Layer Gamma: 1,02410241,0241 , 024 2,048
Beta: 1,02410241,0241 , 024
\libertineSBor
GroupNorm Layer (32 Groups) Gamma: 1,02410241,0241 , 024 2,048
Beta: 1,02410241,0241 , 024
         ReLU
  Fully-Connected Layer Weights: 512×1,0245121024512\times 1,024512 × 1 , 024 524,800
Bias: 512512512512
  BatchNorm Layer Gamma: 512512512512 1,024
Beta: 512512512512
\libertineSBor
GroupNorm Layer (32 Groups) Gamma: 512512512512 1,024
Beta: 512512512512
         ReLU
  Fully-Connected Layer Weights: 256×512256512256\times 512256 × 512 131,328
Bias: 256256256256
  BatchNorm Layer Gamma: 256256256256 512
Beta: 256256256256
\libertineSBor
GroupNorm Layer (32 Groups) Gamma: 256256256256 512
Beta: 256256256256
         ReLU
  Fully-Connected Layer Weights: 53,796×2565379625653,796\times 25653 , 796 × 256 13,825,572
Bias: 53,7965379653,79653 , 796
         Softmax
\libertineSBTotal \libertineSB17,994,852
Table 3. A detailed breakdown of the layers that make up the architecture of the candidate generator NN model.
\Description

A detailed breakdown of the layers that make up the architecture of the candidate generator NN model.

Appendix C Ranker Experiments

In this appendix we will perform various experiments to determine the optimal model architecture for the ranker model described in Section 3.4, as well as the loss function that is used for its training. Furthermore, we will provide a detailed explanation of the chosen NN architecture.

C.1. Embedding Layer Sizes Experiment

Again, the embedding vector sizes for the three embedding layers for the users, the movies, and the movie genres, must be fine-tuned. Having too large embeddings may result in larger model sizes, overfitting, and longer convergence times. Therefore, we experimentally determined the optimal size of embedding vectors for each embedding layer. As can be seen in Figure 18(c), the optimal embedding sizes are 32 for users, 128 for movies, and 16 for genres. In the case of the user and the genre embedding vector sizes, the experiments are clearly determined, as 32 dimensions and 16 dimensions outperform all other embedding vector sizes both in terms of best final accuracy and MSE, as well as in best overall accuracy and MSE, respectively. The results of the movie embedding vector size are a bit ambiguous, as 16 dimensions outperform the other embedding vector sizes in terms of final accuracy and MSE, however, both 128 dimensions and 256 dimensions yield the highest overall accuracies and MSEs. To balance accuracy and computational complexity, we selected 128 dimensions with a higher overall accuracy and MSE than the 16-dimension-case and less computational complexity than the 256-dimension case.

Refer to caption
(a) User Embedding Layer
Refer to caption
(b) Movie Embedding Layer
Refer to caption
(c) Genre Embedding Layer
Figure 18. Validation accuracy (left) and MSE results (right) vs. number of epochs for different embedding vector sizes for the three embedding layers in the ranker model: (18(a)) user embedding layer, (18(b)) movie embedding layer, and (18(c)) genre embedding layer. Each graph shows the minimum and maximum (given by the transparent region), as well as the mean (given by the solid line) of five repetitions of each experiment.
\Description

Validation accuracy (left) and MSE results (right) vs. number of epochs for different embedding vector sizes for the three embedding layers in the ranker model: (18(a)) user embedding layer, (18(b)) movie embedding layer, and (18(c)) genre embedding layer. Each graph shows the minimum and maximum (given by the transparent region), as well as the mean (given by the solid line) of five repetitions of each experiment.

C.2. Number of Hidden Layers Experiment

Similar to the candidate generator, we also determine the optimal number of hidden layers, as shown in Figure 19, resulting in an optimal ranker model with 1 hidden layer. The ranker model with 2 layers converges faster than the ranker model with 1 layer, however, at the expense of a lower final accuracy. Using 3 hidden layers already introduces overfitting and lowers the accuracy further. An optimal model with only 1 hidden layer also requires less computational complexity and is thus beneficial in the FL setting.

Refer to caption
Figure 19. Validation top-100 accuracy (left) and MSE results (right) vs. number of epochs for different numbers of hidden layers for the ranker model. Each of the graphs show the minimum and maximum (given by the transparent region), as well as the mean (given by the solid line) of five repetitions of each experiment.
\Description

Validation top-100 accuracy (left) and MSE results (right) vs. number of epochs for different numbers of hidden layers for the ranker model. Each of the graphs show the minimum and maximum (given by the transparent region), as well as the mean (given by the solid line) of five repetitions of each experiment.

C.3. Loss Function Experiment

As the ranker model is trained to perform a classification task, the softmax cross-entropy loss function can be used. However, unlike in a typical classification problem, we want our prediction to be close to the correct value, even if it is wrong (predicting a rating of 3.5, when the actual ground-truth rating is 4.0 is still better than predicting a rating of 0.5, because the deviation from the true rating is smaller). Therefore, other loss functions such as MSE, which penalize both incorrect predictions and the magnitude of the deviation, may be better suited. To determine this, we conducted experiments using softmax cross-entropy, MSE, and the sum of the two to combine the best of both approaches. The results are shown in Figure 20. Against our expectations, the MSE loss function performs worse than the other two in terms of validation MSE. Here, one would assume that a model optimized on the MSE loss function should perform best when measuring its performance in terms of MSE. Although using the MSE loss function causes the model to not overfit in terms of accuracy like the other loss functions and outperforms them when measuring the final accuracy of the model, the other loss functions converge faster and achieve a better overall accuracy. The softmax cross-entropy loss function and the sum of both yield a similar accuracy, however the softmax cross-entropy loss function is computationally less complex and was therefore selected.

Refer to caption
Figure 20. Validation top-100 accuracy (left) and MSE results (right) vs. number of epochs for different loss functions for training the ranker model. Each of the graphs shows the minimum and maximum (given by the transparent region), as well as the mean (given by the solid line) of five repetitions of each experiment.
\Description

Validation top-100 accuracy (left) and MSE results (right) vs. number of epochs for different loss functions for training the ranker model. Each of the graphs shows the minimum and maximum (given by the transparent region), as well as the mean (given by the solid line) of five repetitions of each experiment.

C.4. Ranker Model Architecture

The final NN architecture that was chosen for the ranker has a 32-dimensional embedding layer for the user, a 128-dimensional embedding layer for the movie, and a 16-dimensional embedding layer for the genres. The genres are then averaged and all inputs, including the embeddings and the movie age, are concatenated. This is followed by a single hidden fully-connected layer. The output of the hidden layer is normalized using a normalization layer, which is followed by a ReLU activation. The hidden layer with its normalization layer and ReLU activation is then followed by an output fully-connected layer, which feeds its logits into a softmax. For the reasons described in Appendix B.4, a GroupNorm layer is used for the FL experiments and a BatchNorm layer is used for the non-FL experiments. A detailed breakdown of the layers that comprise the NN architecture of the ranker model is presented in Table 4.

Type Shape Parameters
User Embedding Layer 162,541×3216254132162,541\times 32162 , 541 × 32 5,201,312
  Movie Embedding Layer 53,796×1285379612853,796\times 12853 , 796 × 128 6,885,888
  Genre Embedding Layer 20×16201620\times 1620 × 16 320
         Genre Embedding Average
         Input Concatenation
  Fully-Connected Layer Weights: 256×177256177256\times 177256 × 177 45,568
Bias: 256256256256
  BatchNorm Layer Gamma: 256256256256 512
Beta: 256256256256
\libertineSB  or
GroupNorm Layer (32 Groups) Gamma: 256256256256 512
Beta: 256256256256
         ReLU
  Fully-Connected Layer Weights: 10×2561025610\times 25610 × 256 2,570
Bias: 10101010
         Softmax
\libertineSBTotal \libertineSB12,136,170
Table 4. A detailed breakdown of the layers that make up the architecture of the ranker NN model.
\Description

A detailed breakdown of the layers that make up the architecture of the ranker NN model.

Appendix D Extended Federated Learning and FedQ Experiment Results

Due to the broad range of different numbers of clients for the FL and FedQ experiments, the complete training graphs are poorly legible and were therefore omitted from Section 4.3 and Section 4.4. Instead, only the final validation top-100 accuracy for the candidate generator experiments, as well as the final accuracy and MSE for the ranker experiments were presented. For reference, the complete training graphs are included in this appendix as Figure 21(b) and Figure 22(b).

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 21. \Acfl experiment results for (21(a)) the candidate generator and (21(b)) the ranker.
\Description
\Ac

fl experiment results for (21(a)) the candidate generator and (21(b)) the ranker.

Refer to caption
(a) Candidate generator
Refer to caption
(b) Ranker
Figure 22. \Acfedq experiment results for (22(a)) the candidate generator and (22(b)) the ranker.
\Description
\Ac

fedq experiment results for (22(a)) the candidate generator and (22(b)) the ranker.

Appendix E Validation of FedQ on the LEAF Federated Learning Benchmark

Although FedQ was developed in the context of a FedRec, it is a much more general algorithm that can be employed in other FL pipelines that have to deal with small local datasets. To provide further evidence of FedQ’s efficacy, it was evaluated on LEAF, which is an open-source, modular benchmarking framework for federated settings (Caldas et al., 2019b). It consists of (1) multiple open-source datasets, (2) reference implementations for common FL methods, and (3) several metrics that measure the statistical properties of the models that are being trained (e.g., accuracy), as well as metrics that measure properties of the FL system (e.g., number of communicated bytes and local computation). The reference implementation currently includes scripts for preprocessing the data, the federated optimization algorithms FedSGD and FedAvg, and one or more model architectures for each of the included datasets. We based the evaluation of FedQ on the following datasets contained in LEAF:

  • \Ac

    femnist (Caldas et al., 2019b) is a dataset that was created by the authors of the LEAF benchmark by partitioning the digit and character images of the Extended MNIST (EMNIST(Cohen et al., 2017) dataset by the person that wrote it. This partitioning makes the dataset more amenable to FL, since writers can be understood as clients. \Acemnist is a dataset that was created from the National Institute of Standards and Technology (NIST) Special Database 19 (Grother and Hanaoka, 1995), which is the same database that the popular MNIST (Lecun et al., 1998) is based on. The NIST Special Database 19 contains handwritten digits, uppercase letters, and lowercase letters, which is much more data than what is exposed by MNIST. \Acemnist was created in an effort to create a more challenging benchmark dataset by covering all data contained in the NIST Special Database 19, while employing the same conversion paradigm used for MNIST to stay compatible.

  • \Ac

    celeba (Liu et al., 2015) is a dataset, which contains images of celebrities that were annotated with 40 attributes, including wearing eyeglasses, wearing a hat, wavy hair, and smiling. For the LEAF benchmark, Large-scale CelebFaces Attributes Dataset (CelebA) was adapted to the federated setting by partitioning it into client datasets based on the celebrity in the image. Furthermore, the classification task was simplified from a multi-label classification task to a binary classification task, which only distinguishes between smiling and not smiling celebrities.

  • Sentiment140 (Go et al., 2009) is an automatically generated dataset that contains Twitter messages that are classified as either positive or negative based on the emoticons contained in them. The dataset therefore presents a binary classification sentiment analysis task, where the input is a sequence of words. For the use in the LEAF benchmark, the messages are partitioned, such that each FL client is represented by a different Twitter user.

  • Reddit (Caldas et al., 2019b) is a dataset that was created by the authors of the LEAF benchmark. They took comments posted on the social network Reddit in December 2017 and preprocessed them by (1) converting all named and numeric HTML character references to their corresponding unicode characters, (2) removing extraneous white spaces, (3) removing non-ASCII characters, (4) replacing URLs, Reddit user names and Subreddit names with special tokens, (5) converting the text to lowercase, and (6) tokenizing it using NLTK’s (Bird et al., 2009) tweet tokenizer. Furthermore, users that were determined to be bots, or that had less than 5 or more than 1000 comments were removed, along with their comments. Caldas et al. sub-sampled the dataset for their own experiments, as their reference implementation is not yet capable of training on the complete Reddit dataset. The training task of the dataset is next word prediction with a sequence of previous words as input. Each Reddit user is considered to be an FL client.

The LEAF benchmark provides two more datasets: Shakespeare (McMahan et al., 2017), which is a dataset that is based on “The Complete Works of William Shakespeare” (Shakespeare, 1994), where each speaking role represents an FL client, and a synthetic dataset that is based on the synthetic dataset proposed by Li et al. (2020b). The Shakespeare dataset comprises 4,226,158 samples across 1,129 FL clients (i.e., speaker roles). On average, each client has 3,743.28 samples with a standard deviation of 6,212.26. \Acfedq is specifically tailored towards federated scenarios with small local datasets, therefore, the Shakespeare dataset is inadequate for evaluating FedQ’s potency, as almost all clients have plenty of local data (only 8 clients have less than 10 and 114 clients have less than 100 samples in their local datasets). The synthetic dataset was specifically designed by Caldas et al. to create a more challenging task for meta-learning methods, which does not apply to FedQ. For these reasons, we decided to only evaluate FedQ on the four above-mentioned datasets.

To conduct the FedQ benchmark experiments, we used the LEAF reference implementation and integrated FedQ as a new federated optimization algorithm. For the FEMNIST dataset, we use a simple two-layer CNN, which consists of two convolution layers each followed by a maximum pooling layer, followed by two fully-connected layers. For the CelebA dataset, we utilize a CNN with four convolution layers, each followed by a BatchNorm and a maximum pooling layer, followed by a single fully-connected layer. For the Sentiment140 dataset, we use a stacked LSTM model with an embedding layer that is initialized with 300-dimensional, pre-trained Global Vectors for Word Representation (GloVe) embeddings, followed by two LSTM cells and two fully-connected layers. For the Reddit dataset, we rely on a stacked LSTM model with an embedding layer that embeds the input words into an 8-dimensional vector space, followed by two LSTM cells with dropout and a single fully-connected layer. All of these models are part of the reference implementation of the LEAF benchmark. The use of pre-trained GloVe embeddings in the stacked LSTM model used for the Sentiment140 dataset is an adaptation that we incorporated. The embedding layer of the original reference implementation was randomly initialized and trained on the Sentiment140 dataset using the GloVe vocabulary to embed its input words into a 100-dimensional vector space. Without this adaptation, the model fails to learn anything using the hyperparameters proposed by Caldas et al.. In fact, the model usually settles in on an accuracy of around 50% after the first round of federated training and more or less keeps that accuracy for the entire duration of the training. For a binary classification task, an accuracy of 50% is not better than random chance. As a matter of fact, Chen et al. also made this adaptation in their LEAF benchmark experiments.

The preprocessed and sub-sampled version of the Reddit dataset used by Caldas et al. was graciously made available for download. All other datasets were preprocessed using the tools provided in the reference implementation of the LEAF benchmark. We used the settings presented in Table 5. The statistics of the resulting datasets can be seen in Table 6.

\Acfemnist \Acceleba Sentiment140
\libertineSBClient Sample Distribution   -s non-i.i.d. non-i.i.d. non-i.i.d.
\libertineSBFraction of Data to Sample   sf 100% 100% 15%
\libertineSBMinimum Number of Samples per Client   -k 0 0 0
\libertineSBTraining/Test Data Split Mode   -t Sample Sample Sample
\libertineSBTraining Data Fraction   tf 90% 90% 90%
\libertineSBSampling Seed   smplseed 1691607340 1691605746 1692132357
\libertineSBSplit Seed   spltseed 1691608842 1691605747 1692132372
Table 5. The settings used to preprocess the FEMNIST, CelebA, and Sentiment140 datasets.
\Description

The settings used to preprocess the FEMNIST, CelebA, and Sentiment140 datasets.

\Acfemnist \Acceleba Sentiment140 Reddit
\libertineSBNumber of Clients 3,597 9,343 99,149 817
(of 660,120) (of 1,660,820)
\libertineSBNumber of Samples 817,851 200,288 240,074 55,556
(of 1,600,498) (of 56,587,343)
  \libertineSBSamples per Client Minimum 19 5 1 10
Maximum 584 35 236 1,394
Mean 227.37 21.44 2.42 68.0
Standard Deviation 88.84 7.63 4.63 120.27
Table 6. The dataset statistics of the preprocessed FEMNIST, CelebA, Sentiment140, and Reddit datasets.
\Description

The dataset statistics of the preprocessed FEMNIST, CelebA, Sentiment140, and Reddit datasets.

All models were trained using the default random seeds. Most of the remaining hyperparameters, however, deviate from the hyperparameters suggested by Caldas et al.. Especially the number of clients per communication round was increased to facilitate different queue lengths for FedQ. The hyperparameters used for each dataset are specified in Table 7. The experiments for each dataset were repeated three times, once with FedAvg as a baseline against which FedQ can be compared, once with FedQ and a queue length of 10, and once with FedQ and a queue length of 100. The results of the experiments can be seen in Figure 23.

\Acfemnist \Acceleba Sentiment140 Reddit
\libertineSBCommunication Rounds 400 400 400 100
num-rounds
\libertineSBClients per Communication Round 1,000 1,000 1,000 500
clients-per-round
\libertineSBLearning Rate 0.01 0.01 0.01 8.0
-lr
\libertineSBBatch Size 10 10 10 5
batch-size
\libertineSBLocal Epochs 5 5 5 1
num-epochs
Table 7. The hyperparameters that were used for benchmarking FedQ on LEAF.
\Description

The hyperparameters that were used for benchmarking FedQ on LEAF.

Refer to caption
Figure 23. \Acfedq LEAF benchmark experiment results.
\Description
\Ac

fedq LEAF benchmark experiment results.

In all experiments, FedQ with a queue length of 10 had a higher final accuracy than the other two experiments. In the cases of the Sentiment140 and the Reddit datasets, it even manages to clearly outperform FedQ with a queue length of 100. This is interesting in two ways: First of all, in the FedQ experiments on our FedRec, there was always a benefit when using a larger queue length, albeit with diminishing returns. The LEAF benchmark experiments not only show that using a larger queue length does not always result in a significant increase in performance, but it may even make the training unstable and hinders convergence, as is the case for the Sentiment140 and the Reddit dataset. The second remarkable thing is, that in both cases where a larger queue length causes the training to become unstable, the model is an LSTM. Of course, no trend can be derived from just these experiments, but this interesting behavior could be explored in future work.

The margins with which FedQ outperforms the FedAvg baseline are much smaller as compared to the results achieved with the models of our FedRec. Nonetheless, it can be clearly seen that FedQ has a much faster convergence rate. The FedAvg baseline reaches its highest accuracy in all cases at the very end of the training window (communication round 400/400 for FEMNIST, 390/400 for Sentiment140, 400/400 for CelebA, and 96/100 for Reddit). Both FedQ experiments are able to reach or exceed the baseline’s highest accuracy in a much shorter time frame: For FEMNIST, FedQ with a queue length of 10 exceeded FedAvg at communication round 40, while FedQ with a queue length of 100 already outperformed FedAvg at communication round 10. For Sentiment140, FedQ with a queue length of 10 surpasses FedAvg at communication round 180 and FedQ with a queue length of 100 at communication round 50. For CelebA, the communication rounds were 50 and 20, while those for Reddit were 180 and 50 respectively. It should also be noted that, although its training was less stable, FedQ with a queue length of 100 outperformed the other two experiments in terms of highest accuracy for all datasets except for Reddit. It was also always significantly faster to exceed the highest accuracy of FedAvg than FedQ with a queue length of 10. Table 8 presents the results of our experiments in comparison to the results published by Caldas et al. (2019b). Please be aware that our experiments used different hyperparameters for both the preprocessing of the datasets as well as the training of the models, which renders the results incomparable. Particularly notable is the difference in the Sentiment140 dataset, where Caldas et al. report on four experiments with varying minimum numbers of samples per client ranging from 3 to 100. In our experiments, we have set the minimum number of samples per client to 0 in the preprocessing of Sentiment140, which means that our experiments had a considerably lower number of samples per client on average. We have still included the results for your reference and as another baseline.

Dataset Method Result
\Acfemnist \Acfedavg (LEAF) 74.72%
\Acfedavg (ours) 82.66%
\Acfedq (ours) Queue Length 10 86.98%
Queue Length 100 86.65%
\Acceleba \Acfedavg (LEAF) 89.46%
\Acfedavg (ours) 91.24%
\Acfedq (ours) Queue Length 10 91.98%
Queue Length 100 91.58%
Sentiment140 \Acfedavg (LEAF) 3absent3\geq 3≥ 3 Samples per Client ~50%*
10absent10\geq 10≥ 10 Samples per Client ~50%*
30absent30\geq 30≥ 30 Samples per Client ~60%*
100absent100\geq 100≥ 100 Samples per Client ~69%*
\Acfedavg (ours) 69.91%
\Acfedq (ours) Queue Length 10 71.18%
Queue Length 100 69.32%
Reddit \Acfedavg (LEAF) 13.35%
\Acfedavg (ours) 13.23%
\Acfedq (ours) Queue Length 10 14.60%
Queue Length 100 13.04%
Table 8. Comparison of the FedQ LEAF benchmark results against the results published by Caldas et al. (2019b). *Please note that Caldas et al. do not publish final accuracies for the Sentiment140 dataset. The accuracies shown in the table were read from the graph in Figure 3 (Caldas et al., 2019b) and are only approximations.
\Description

Comparison of the FedQ LEAF benchmark results against the results published by Caldas et al. (2019b). *Please note that Caldas et al. do not publish final accuracies for the Sentiment140 dataset. The accuracies shown in the table were read from the graph in Figure 3 (Caldas et al., 2019b) and are only approximations.

In conclusion, we think these experiments demonstrate that FedQ is capable of outperforming FedAvg on a wide variety of data modalities and training tasks. Although the margin with which FedQ outperforms the baseline varies with dataset and model architecture, its ability to drastically improve convergence speed makes it particularly efficacious.

Appendix F FedQ and Other Client Chaining Techniques

This appendix section describes further techniques for FL client chaining in comparison to FedQ, that have been developed in parallel to our method, and provides similarities and differences between them. Kamp et al. (2023), for example, aim to improve FL in scenarios where each client only has a small local dataset. They propose a technique called federated daisy-chaining (FedDC), where the central server, instead of aggregating the updated local models of the clients into a new global model, sends each updated local model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a randomly selected client cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where ij𝑖𝑗i\neq jitalic_i ≠ italic_j. After a few rounds of this daisy-chaining, the resulting models are aggregated analogously to FedAvg. Hu et al. (2022) tackle the problem of non-i.i.d. data in FL and propose a technique called federated learning via device concatenation (FedCat) that is essentially equivalent to FedDC. The main difference between FedDC and FedCat is, that in FedCat each model is trained by each client before they are aggregated to form a new global model and only the order of the client updates differs, while in FedDC, depending on the daisy-chaining period, each model is only trained on a random subset of all clients. Zaccone et al. (2022) also try to alleviate the problem of heterogeneous client datasets by proposing a technique called federated learning via sequential superclients training (FedSeq). They perform a pre-training phase, after which they use the resulting model to estimate the data generating distribution of each client. Using the estimated distributions, they generate groups of clients with different local distributions, which they denote as superclients. During FL, the clients within each superclient are trained sequentially, where the first client receives the global model and all consecutive clients receive the model of the previous client. The resulting local models of the superclients are then aggregated as in FedAvg.

All of the proposed techniques have similar goals and try to solve these problems by chaining the local training of multiple clients, but each of the techniques has variations in the training protocol that they follow. Both FedDC and FedCat train as many different models as there are clients in each communication round. \Acfedseq and our method FedQ, however, only train #clients#clientspersuperclient/queue#𝑐𝑙𝑖𝑒𝑛𝑡𝑠#𝑐𝑙𝑖𝑒𝑛𝑡𝑠𝑝𝑒𝑟𝑠𝑢𝑝𝑒𝑟𝑐𝑙𝑖𝑒𝑛𝑡𝑞𝑢𝑒𝑢𝑒\frac{\#clients}{\#clients\,per\,superclient/queue}divide start_ARG # italic_c italic_l italic_i italic_e italic_n italic_t italic_s end_ARG start_ARG # italic_c italic_l italic_i italic_e italic_n italic_t italic_s italic_p italic_e italic_r italic_s italic_u italic_p italic_e italic_r italic_c italic_l italic_i italic_e italic_n italic_t / italic_q italic_u italic_e italic_u italic_e end_ARG models per communication round. In FedDC, FedCat, and FedQ the clients for the sequential training are selected randomly, while in FedSeq they are purposely selected in order to group clients together that have different data generating distributions.