Search results
(1 - 2 of 2)
- Title
- Workload Interference Analysis and Mitigation on Dragonfly Class Networks
- Creator
- Kang, Yao
- Date
- 2022
- Description
-
Dragonfly class of networks are promising interconnect topologies that support current and next-generation high-performance computing (HPC)...
Show moreDragonfly class of networks are promising interconnect topologies that support current and next-generation high-performance computing (HPC) systems. Serving as the "central nervous system", Dragonfly tightly couples tens of thousands of compute nodes together by providing high-bandwidth, low-latency data exchange for exascale computing capability. Dragonfly can support unprecedented system scale at a reasonable cost thanks to its hierarchical architecture. In Dragonfly systems, network resources such as routers and links are arranged into identical groups.Groups are all-to-all connected through global links, and routers within groups are connected via local links. In contrast to the fully connected inter-group topology, connections for the routers within groups are designed according to the system requirement. For example, the one-dimensional all-to-all connection is favored for higher network bandwidth, a two-dimensional grid arrangement can be constructed to support larger system size, and a tree structure router connection is built for the extreme system scale. The hierarchical design with groups enables the topology to support unprecedented system size while maintaining a low-diameter network. Packets can be minimally delivered by simply traversing the network hierarchy between groups through global links and reaching their destinations through local links. In case of network congestion, packets can be non-minimally forwarded through any intermediate group to increase the system throughput. As a result, all network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, shared network resources lead to inevitable network contention among different traffic flows, especially for the systems that hold multiple workloads at the same time. This network contention is observed as the workload interference that causes degraded system performance with delayed workload execution time. In this thesis, we first model and analyze the workload interference effect on Dragonfly+ topology through extensive system simulation.Based on the comprehensive interference study, we propose Q-adaptive routing, a multi-agent reinforcement learning based solution for Dragonfly systems. Compared with the existing routing solutions, the proposed Q-adaptive routing can learn to forward packets more efficiently with smaller packet latency and higher system throughput. Next, we demonstrate that intelligent routing algorithms such as Q-adaptive routing can greatly mitigate workload interference and optimize the overall system performance. Subsequently, we propose a dynamic job placement strategy for workload interference prevention. When combined with Q-adaptive routing, dynamic job placement gives users the flexibility to either reduce workload interference from communication intensive applications or protect target applications for higher performance stability.
Show less
- Title
- Heterogeneous Workloads Study towards Large-scale Interconnect Network Simulation
- Creator
- Wang, Xin
- Date
- 2023
- Description
-
High-bandwidth, low-latency interconnect networks play a key role in the design of modern high- performance computing (HPC) systems. The ever...
Show moreHigh-bandwidth, low-latency interconnect networks play a key role in the design of modern high- performance computing (HPC) systems. The ever-increasing need for higher bandwidth and higher message rate has driven the design of low-diameter interconnect topologies like variants of dragonfly. As these hierarchical networks become increasingly dominant, interference caused by resource sharing can lead to significant network congestion and performance variability. Meanwhile, with the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. However, little work has been conducted to understand performance implications of co-running heterogeneous workloads on large-scale dragonfly systems. There is a greater need to study how different interconnect technologies affect workload performance, and how conventional scientific applications interact with emerging big data applications at the underlying interconnect level. In this work, we firstly present a comparative analysis exploring the communication interference for traditional HPC applications by analyzing the trade-off between localizing communication and balancing network traffic. We conduct trace-based simulations for applications with different communication patterns, using multiple job placement policies and routing mechanisms. Then we develop a scalable workload manager that provides an automatic framework to facilitate hybrid workload simulation. We investigate various hybrid workloads and navigate various application-system configurations for a deeper understanding of performance implications of a diverse mix of workloads on current and future supercomputers. Finally, we propose a scalable framework, Union+, that enables simulation of communication and I/O simultaneously. By combining different levels of abstraction, Union+ is able to efficiently co-model the communication and I/O traffic on HPC systems that equipped with flash-based storage. We conduct experiments with different system configurations, showing how Union+ can help system designers to assess the usefulness of future technologies in next-generation HPC machines.
Show less