Search results
(1 - 1 of 1)
- Title
- Workload Interference Analysis and Mitigation on Dragonfly Class Networks
- Creator
- Kang, Yao
- Date
- 2022
- Description
-
Dragonfly class of networks are promising interconnect topologies that support current and next-generation high-performance computing (HPC)...
Show moreDragonfly class of networks are promising interconnect topologies that support current and next-generation high-performance computing (HPC) systems. Serving as the "central nervous system", Dragonfly tightly couples tens of thousands of compute nodes together by providing high-bandwidth, low-latency data exchange for exascale computing capability. Dragonfly can support unprecedented system scale at a reasonable cost thanks to its hierarchical architecture. In Dragonfly systems, network resources such as routers and links are arranged into identical groups.Groups are all-to-all connected through global links, and routers within groups are connected via local links. In contrast to the fully connected inter-group topology, connections for the routers within groups are designed according to the system requirement. For example, the one-dimensional all-to-all connection is favored for higher network bandwidth, a two-dimensional grid arrangement can be constructed to support larger system size, and a tree structure router connection is built for the extreme system scale. The hierarchical design with groups enables the topology to support unprecedented system size while maintaining a low-diameter network. Packets can be minimally delivered by simply traversing the network hierarchy between groups through global links and reaching their destinations through local links. In case of network congestion, packets can be non-minimally forwarded through any intermediate group to increase the system throughput. As a result, all network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, shared network resources lead to inevitable network contention among different traffic flows, especially for the systems that hold multiple workloads at the same time. This network contention is observed as the workload interference that causes degraded system performance with delayed workload execution time. In this thesis, we first model and analyze the workload interference effect on Dragonfly+ topology through extensive system simulation.Based on the comprehensive interference study, we propose Q-adaptive routing, a multi-agent reinforcement learning based solution for Dragonfly systems. Compared with the existing routing solutions, the proposed Q-adaptive routing can learn to forward packets more efficiently with smaller packet latency and higher system throughput. Next, we demonstrate that intelligent routing algorithms such as Q-adaptive routing can greatly mitigate workload interference and optimize the overall system performance. Subsequently, we propose a dynamic job placement strategy for workload interference prevention. When combined with Q-adaptive routing, dynamic job placement gives users the flexibility to either reduce workload interference from communication intensive applications or protect target applications for higher performance stability.
Show less