Distributed, Parallel, and Cluster Computing
See recent articles
Showing new listings for Friday, 1 November 2024
- [1] arXiv:2410.23477 [pdf, html, other]
-
Title: A Committee Based Optimal Asynchronous Byzantine Agreement Protocol W.P. 1Comments: arXiv admin note: text overlap with arXiv:2406.03739Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Multi-valued Byzantine agreement (MVBA) protocols are essential for atomic broadcast and fault-tolerant state machine replication in asynchronous networks. Despite advances, challenges persist in optimizing these protocols for communication and computation efficiency. This paper presents a committee-based MVBA protocol (cMVBA), a novel approach that achieves agreement without extra communication rounds by analyzing message patterns in asynchronous networks with probability 1.
- [2] arXiv:2410.23497 [pdf, html, other]
-
Title: To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/OSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Modern scientific simulations generate massive volumes of data, creating significant challenges for I/O and storage systems. Error-bounded lossy compression (EBLC) offers a solution by reducing dataset sizes while preserving data quality within user-specified limits. This study provides the first comprehensive energy characterization of state-of-the-art EBLC algorithms across various scientific datasets, CPU architectures, and operational modes. We analyze the energy consumption patterns of compression and decompression operations, as well as the energy trade-offs in data I/O scenarios. Our findings demonstrate that EBLC can significantly reduce I/O energy consumption, with savings of up to two orders of magnitude compared to uncompressed I/O for large datasets. In multi-node HPC environments, we observe energy reductions of approximately 25% when using EBLC. We also show that EBLC can achieve compression ratios of 10-100x, potentially reducing storage device requirements by nearly two orders of magnitude.
Our work demonstrates the relationships between compression ratios, energy efficiency, and data quality, highlighting the importance of considering compressors and error bounds for specific use cases. Based on our results, we estimate that large-scale HPC facilities could save nearly two orders of magnitude the energy on data writing and significantly reduce storage requirements by integrating EBLC into their I/O subsystems. This work provides a framework for system operators and computational scientists to make informed decisions about implementing EBLC for energy-efficient data management in HPC environments. - [3] arXiv:2410.23881 [pdf, html, other]
-
Title: DynaSplit: A Hardware-Software Co-Design Framework for Energy-Aware Inference on EdgeSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
The deployment of ML models on edge devices is challenged by limited computational resources and energy availability. While split computing enables the decomposition of large neural networks (NNs) and allows partial computation on both edge and cloud devices, identifying the most suitable split layer and hardware configurations is a non-trivial task. This process is in fact hindered by the large configuration space, the non-linear dependencies between software and hardware parameters, the heterogeneous hardware and energy characteristics, and the dynamic workload conditions. To overcome this challenge, we propose DynaSplit, a two-phase framework that dynamically configures parameters across both software (i.e., split layer) and hardware (e.g., accelerator usage, CPU frequency). During the Offline Phase, we solve a multi-objective optimization problem with a meta-heuristic approach to discover optimal settings. During the Online Phase, a scheduling algorithm identifies the most suitable settings for an incoming inference request and configures the system accordingly. We evaluate DynaSplit using popular pre-trained NNs on a real-world testbed. Experimental results show a reduction in energy consumption up to 72% compared to cloud-only computation, while meeting ~90% of user request's latency threshold compared to baselines.
New submissions (showing 3 of 3 entries)
- [4] arXiv:2410.23317 (cross-list from cs.CV) [pdf, html, other]
-
Title: VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference AccelerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.
- [5] arXiv:2410.23546 (cross-list from cs.CR) [pdf, html, other]
-
Title: EVeCA: Efficient and Verifiable On-Chain Data Query Framework Using Challenge-Based AuthenticationSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
As blockchain applications become increasingly widespread, there is a rising demand for on-chain data queries. However, existing schemes for on-chain data queries face a challenge between verifiability and efficiency. Queries on blockchain databases can compromise the authenticity of the query results, while schemes that utilize on-chain Authenticated Data Structure (ADS) have lower efficiency. To overcome this limitation, we propose an efficient and verifiable on-chain data query framework EVeCA. In our approach, we free the full nodes from the task of ADS maintenance by delegating it to a limited number of nodes, and full nodes verify the correctness of ADS by using challenge-based authentication scheme instead of reconstructing them, which prevents the service providers from maintaining incorrect ADS with overwhelming probability. By carefully designing the ADS verification scheme, EVeCA achieves higher efficiency while remaining resilient against adaptive attacks. Our framework effectively eliminates the need for on-chain ADS maintenance, and allows full nodes to participate in ADS maintenance in a cost-effective way. We demonstrate the effectiveness of the proposed scheme through security analysis and experimental evaluation. Compared to existing schemes, our approach improves ADS maintenance efficiency by about 20*.
- [6] arXiv:2410.23661 (cross-list from cs.OS) [pdf, html, other]
-
Title: Microsecond-scale Dynamic Validation of Idempotency for GPU KernelsSubjects: Operating Systems (cs.OS); Distributed, Parallel, and Cluster Computing (cs.DC)
We discovered that a GPU kernel can have both idempotent and non-idempotent instances depending on the input. These kernels, called conditionally-idempotent, are prevalent in real-world GPU applications (490 out of 547 from six applications). Consequently, prior work that classifies GPU kernels as either idempotent or non-idempotent can severely compromise the correctness or efficiency of idempotence-based systems. This paper presents PICKER, the first system for instance-level idempotency validation. PICKER dynamically validates the idempotency of GPU kernel instances before their execution, by utilizing their launch arguments. Several optimizations are proposed to significantly reduce validation latency to microsecond-scale. Evaluations using representative GPU applications (547 kernels and 18,217 instances in total) show that PICKER can identify idempotent instances with no false positives and a false-negative rate of 18.54%, and can complete the validation within 5 us for all instances. Furthermore, by integrating PICKER, a fault-tolerant system can reduce the checkpoint cost to less than 4% and a scheduling system can reduce the preemption latency by 84.2%.
- [7] arXiv:2410.23745 (cross-list from cs.LG) [pdf, html, other]
-
Title: Syno: Structured Synthesis for Neural OperatorsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Programming Languages (cs.PL)
The desires for better prediction accuracy and higher execution performance in neural networks never end. Neural architecture search (NAS) and tensor compilers are two popular techniques to optimize these two goals, but they are both limited to composing or optimizing existing manually designed operators rather than coming up with completely new designs. In this work, we explore the less studied direction of neural operator synthesis, which aims to automatically and efficiently discover novel neural operators with better accuracy and/or speed. We develop an end-to-end framework Syno, to realize practical neural operator synthesis. Syno makes use of a novel set of fine-grained primitives defined on tensor dimensions, which ensure various desired properties to ease model training, and also enable expression canonicalization techniques to avoid redundant candidates during search. Syno further adopts a novel guided synthesis flow to obtain valid operators matched with the specified input/output dimension sizes, and leverages efficient stochastic tree search algorithms to quickly explore the design space. We demonstrate that Syno discovers better operators with an average of $2.06\times$ speedup and less than $1\%$ accuracy loss, even on NAS-optimized models.
- [8] arXiv:2410.23761 (cross-list from cs.PL) [pdf, other]
-
Title: Abstract Continuation Semantics for Multiparty Interactions in Process Calculi based on CCSEneia Nicolae Todoran (Dept. of Computer Science, Technical University of Cluj-Napoca), Gabriel Ciobanu (Academia Europaea)Comments: In Proceedings FROM 2024, arXiv:2410.23020Journal-ref: EPTCS 410, 2024, pp. 18-37Subjects: Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
We develop denotational and operational semantics designed with continuations for process calculi based on CCS extended with mechanisms offering support for multiparty interactions. We investigate the abstractness of this continuation semantics. We show that our continuation-based denotational models are weakly abstract with respect to the corresponding operational models.
- [9] arXiv:2410.23857 (cross-list from quant-ph) [pdf, html, other]
-
Title: ECDQC: Efficient Compilation for Distributed Quantum Computing with Linear LayoutKecheng Liu, Yidong Zhou, Haochen Luo, Lingjun Xiong, Yuchen Zhu, Eilis Casey, Jinglei Cheng, Samuel Yen-Chi Chen, Zhiding LiangSubjects: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC)
In this paper, we propose an efficient compilation method for distributed quantum computing (DQC) using the Linear Nearest Neighbor (LNN) architecture. By exploiting the LNN topology's symmetry, we optimize quantum circuit compilation for High Local Connectivity, Sparse Full Connectivity (HLC-SFC) algorithms like Quantum Approximate Optimization Algorithm (QAOA) and Quantum Fourier Transform (QFT). We also utilize dangling qubits to minimize non-local interactions and reduce SWAP gates. Our approach significantly decreases compilation time, gate count, and circuit depth, improving scalability and robustness for large-scale quantum computations.
- [10] arXiv:2410.24174 (cross-list from cs.CE) [pdf, other]
-
Title: Novel Architecture for Distributed Travel Data Integration and Service Provision Using MicroservicesComments: 20 pages, 12 figuresSubjects: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
This paper introduces a microservices architecture for the purpose of enhancing the flexibility and performance of an airline reservation system. The architectural design incorporates Redis cache technologies, two different messaging systems (Kafka and RabbitMQ), two types of storages (MongoDB, and PostgreSQL). It also introduces authorization techniques, including secure communication through OAuth2 and JWT which is essential with the management of high-demand travel services. According to selected indicators, the architecture provides an impressive level of data consistency at 99.5% and a latency of data propagation of less than 75 ms allowing rapid and reliable intercommunication between microservices. A system throughput of 1050 events per second was achieved so that the acceptability level was maintained even during peak time. Redis caching reduced a 92% cache hit ratio on the database thereby lowering the burden on the database and increasing the speed of response. Further improvement of the systems scalability was done through the use of Docker and Kubernetes which enabled services to be expanded horizontally to cope with the changes in demand. The error rates were very low, at 0.2% further enhancing the efficiency of the system in handling real-time data integration. This approach is suggested to meet the specific needs of the airline reservation system. It is secure, fast, scalable, all serving to improve the user experience as well as the efficiency of operations. The low latency and high data integration levels and prevaiing efficient usage of the resources demonstrates the architecture ability to offer continued support in the ever growing high demand situations.
Cross submissions (showing 7 of 7 entries)
- [11] arXiv:2401.05868 (replaced) [pdf, other]
-
Title: Efficient N-to-M Checkpointing Algorithm for Finite Element SimulationsComments: author accepted manuscriptSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS)
In this work, we introduce a new algorithm for N-to-M checkpointing in finite element simulations. This new algorithm allows efficient saving/loading of functions representing physical quantities associated with the mesh representing the physical domain. Specifically, the algorithm allows for using different numbers of parallel processes for saving and loading, allowing for restarting and post-processing on the process count appropriate to the given phase of the simulation and other conditions. For demonstration, we implemented this algorithm in PETSc, the Portable, Extensible Toolkit for Scientific Computation, and added a convenient high-level interface into Firedrake, a system for solving partial differential equations using finite element methods. We evaluated our new implementation by saving and loading data involving 8.2 billion finite element degrees of freedom using 8,192 parallel processes on ARCHER2, the UK National Supercomputing Service.
- [12] arXiv:2403.10332 (replaced) [pdf, other]
-
Title: GreedyML: A Parallel Algorithm for Maximizing Submodular FunctionsComments: We have learned that we incorrectly applied Lemma 4.1 in Lemma 4.2. Thus, the proof of Lemma 4.2 is incorrect, and the overall approximation guarantee in Theorem 4.4 we claimed is not rightSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
We describe a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical applications in areas such as data summarization, machine learning, and graph sparsification. Our work builds on the randomized distributed RandGreedI algorithm, proposed by Barbosa, Ene, Nguyen, and Ward (2015). This algorithm computes a distributed solution by randomly partitioning the data among all the processors and then employing a single accumulation step in which all processors send their partial solutions to one processor. However, for large problems, the accumulation step could exceed the memory available on a processor, and the processor which performs the accumulation could become a computational bottleneck.
Here, we propose a generalization of the RandGreedI algorithm that employs multiple accumulation steps to reduce the memory required. We analyze the approximation ratio and the time complexity of the algorithm (in the BSP model). We also evaluate the new GreedyML algorithm on three classes of problems, and report results from massive data sets with millions of elements. The results show that the GreedyML algorithm can solve problems where the sequential Greedy and distributed RandGreedI algorithms fail due to memory constraints. For certain computationally intensive problems, the GreedyML algorithm can be faster than the RandGreedI algorithm. The observed approximation quality of the solutions computed by the GreedyML algorithm closely matches those obtained by the RandGreedI algorithm on these problems. - [13] arXiv:2408.10830 (replaced) [pdf, html, other]
-
Title: Single Bridge Formation in Self-Organizing Particle SystemsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
Local interactions of uncoordinated individuals produce the collective behaviors of many biological systems, inspiring much of the current research in programmable matter. A striking example is the spontaneous assembly of fire ants into "bridges" comprising their own bodies to traverse obstacles and reach sources of food. Experiments and simulations suggest that, remarkably, these ants always form one bridge -- instead of multiple, competing bridges -- despite a lack of central coordination. We argue that the reliable formation of a single bridge does not require sophistication on behalf of the individuals by provably reproducing this behavior in a self-organizing particle system. We show that the formation of a single bridge by the particles is a statistical inevitability of their preferences to move in a particular direction, such as toward a food source, and their preference for more neighbors. Two parameters, $\eta$ and $\beta$, reflect the strengths of these preferences and determine the Gibbs stationary measure of the corresponding particle system's Markov chain dynamics. We show that a single bridge almost certainly forms when $\eta$ and $\beta$ are sufficiently large. Our proof introduces an auxiliary Markov chain, called an "occupancy chain," that captures only the significant, global changes to the system. Through the occupancy chain, we abstract away information about the motion of individual particles, but we gain a more direct means of analyzing their collective behavior. Such abstractions provide a promising new direction for understanding many other systems of programmable matter.
- [14] arXiv:2409.17446 (replaced) [pdf, html, other]
-
Title: Efficient Federated Learning against Heterogeneous and Non-stationary Client UnavailabilityComments: NeurIPS 2024Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Optimization and Control (math.OC)
Addressing intermittent client availability is critical for the real-world deployment of federated learning algorithms. Most prior work either overlooks the potential non-stationarity in the dynamics of client unavailability or requires substantial memory/computation overhead. We study federated learning in the presence of heterogeneous and non-stationary client availability, which may occur when the deployment environments are uncertain, or the clients are mobile. The impacts of heterogeneity and non-stationarity on client unavailability can be significant, as we illustrate using FedAvg, the most widely adopted federated learning algorithm. We propose FedAPM, which includes novel algorithmic structures that (i) compensate for missed computations due to unavailability with only $O(1)$ additional memory and computation with respect to standard FedAvg, and (ii) evenly diffuse local updates within the federated learning system through implicit gossiping, despite being agnostic to non-stationary dynamics. We show that FedAPM converges to a stationary point of even non-convex objectives while achieving the desired linear speedup property. We corroborate our analysis with numerical experiments over diversified client unavailability dynamics on real-world data sets.
- [15] arXiv:2410.14348 (replaced) [pdf, html, other]
-
Title: TF-DDRL: A Transformer-enhanced Distributed DRL Technique for Scheduling IoT Applications in Edge and Cloud Computing EnvironmentsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
With the continuous increase of IoT applications, their effective scheduling in edge and cloud computing has become a critical challenge. The inherent dynamism and stochastic characteristics of edge and cloud computing, along with IoT applications, necessitate solutions that are highly adaptive. Currently, several centralized Deep Reinforcement Learning (DRL) techniques are adapted to address the scheduling problem. However, they require a large amount of experience and training time to reach a suitable solution. Moreover, many IoT applications contain multiple interdependent tasks, imposing additional constraints on the scheduling problem. To overcome these challenges, we propose a Transformer-enhanced Distributed DRL scheduling technique, called TF-DDRL, to adaptively schedule heterogeneous IoT applications. This technique follows the Actor-Critic architecture, scales efficiently to multiple distributed servers, and employs an off-policy correction method to stabilize the training process. In addition, Prioritized Experience Replay (PER) and Transformer techniques are introduced to reduce exploration costs and capture long-term dependencies for faster convergence. Extensive results of practical experiments show that TF-DDRL, compared to its counterparts, significantly reduces response time, energy consumption, monetary cost, and weighted cost by up to 60%, 51%, 56%, and 58%, respectively.
- [16] arXiv:2410.18772 (replaced) [pdf, other]
-
Title: Search for shortest paths based on a projective description of unweighted graphsComments: 12 pages, 1 figures, 5 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The search is based on the preliminary transformation of matrices or adjacency lists traditionally used in the study of graphs into projections cleared of redundant information (refined) followed by the selection of the desired shortest paths. Each projection contains complete information about all the shortest paths from its base (angle vertex) and is based on an enumeration of reachability relations, more complex than the traditionally used binary adjacency relations. The class of graphs considered was expanded to mixed graphs containing both undirected and oriented edges (arcs). A method for representing graph projections in computer memory and finding shortest paths using them is proposed. The reduction in algorithmic complexity achieved, at the same time, will allow the proposed method to be used in information network applications, scientific and technical, transport and logistics, and economic fields.
- [17] arXiv:2403.03333 (replaced) [pdf, html, other]
-
Title: Federated Learning over Connected ModesSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Statistical heterogeneity in federated learning poses two major challenges: slow global training due to conflicting gradient signals, and the need of personalization for local distributions. In this work, we tackle both challenges by leveraging recent advances in \emph{linear mode connectivity} -- identifying a linearly connected low-loss region in the parameter space of neural networks, which we call solution simplex. We propose federated learning over connected modes (\textsc{Floco}), where clients are assigned local subregions in this simplex based on their gradient signals, and together learn the shared global solution simplex. This allows personalization of the client models to fit their local distributions within the degrees of freedom in the solution simplex and homogenizes the update signals for the global simplex training. Our experiments show that \textsc{Floco} accelerates the global training process, and significantly improves the local accuracy with minimal computational overhead in cross-silo federated learning settings.
- [18] arXiv:2410.22080 (replaced) [pdf, html, other]
-
Title: A New Broadcast Primitive for BFT ProtocolsManu Drijvers, Tim Gretler, Yotam Harchol, Tobias Klenze, Ognjen Maric, Stefan Neamtu, Yvonne-Anne Pignolet, Rostislav Rumenov, Daniel Sharifi, Victor ShoupComments: 14 pages, 8 figures,Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)
Byzantine fault tolerant (BFT) protocol descriptions often assume application-layer networking primitives, such as best-effort and reliable broadcast, which are impossible to implement in practice in a Byzantine environment as they require either unbounded buffering of messages or giving up liveness, under certain circumstances. However, many of these protocols do not (or can be modified to not) need such strong networking primitives. In this paper, we define a new, slightly weaker networking primitive that we call abortable broadcast. We describe an implementation of this new primitive and show that it (1) still provides strong delivery guarantees, even in the case of network congestion, link or peer failure, and backpressure, (2) preserves bandwidth, and (3) enforces all data structures to be bounded even in the presence of malicious peers. The latter prevents out-of-memory DoS attacks by malicious peers, an issue often overlooked in the literature. The new primitive and its implementation are not just theoretical. We use them to implement the BFT protocols in the IC (Internet Computer), a publicly available blockchain network that enables replicated execution of general-purpose computation, serving hundreds of thousands of applications and their users.