Search results
(1 - 1 of 1)
- Title
- Efficient and Practical Cluster Scheduling for High Performance Computing
- Creator
- Li, Boyang
- Date
- 2023
- Description
-
Cluster scheduling plays a crucial role in the high-performance computing (HPC) area. It is responsible for allocating resources and...
Show moreCluster scheduling plays a crucial role in the high-performance computing (HPC) area. It is responsible for allocating resources and determining the order in which jobs are executed. Existing HPC job schedulers typically leverage simpleheuristics to schedule jobs, but such scheduling policies struggle to keep pace with modern changes and technology trends. The study of this dissertation is motivated by two new trends in HPC community: the rapid growth of heterogeneous system infrastructure and the emergence of artificial intelligence (AI) technologies. First, existing scheduling policies are solely CPU-centric. In contrast, systems become more complex and heterogeneous, and emerging workloads have diverse resource requirements, such as CPU, burst buffer, power, network bandwidth, and so on. Second, previous heuristic scheduling approaches are manually designed. Such a manual design process prevents adaptive and informative scheduling decisions. A recent trend in HPC is to intertwine AI to better leverage the investment of supercomputers. This embrace of AI provides opportunities to design more intelligent scheduling methods. In this dissertation, we propose an efficient and practical cluster scheduling framework for HPC systems. Our framework leverages AI technologies and considers system heterogeneity. The framework comprises four major components. First, shared network systems such as dragonfly-based systems are vulnerable to performance variability due to network sharing. To mitigate workload interference on these shared network systems, we explore a dedicated scheduling policy. Next, emerging workloads in HPC have diverse resource requirements instead of being CPU-centric. To cater to this, we design an intelligent scheduling agent for multi-resource scheduling in HPC leveraging the advanced multi-objective reinforcement learning (MORL) algorithm. Subsequently, we address the issues with existing state encoding approaches in RL-driven scheduling, which either lack critical scheduling information or suffer from poor scalability. To this end, we present an efficient and scalable encoding model. Lastly, the lack of interpretability of RL methods poses a significant challenge to deploying RL-driven scheduling in production systems. In response, we provide a simple, deterministic, and easily understandable model for interpreting RL-driven scheduling. The proposed models and algorithms are evaluated with real job traces from production supercomputers. Experimental results show our schemes can effectively improve job scheduling in terms of both user satisfaction and system utilization.
Show less