Job scheduler is a crucial component in high-performance computing (HPC) systems. It sorts and allocates jobs according to site policies and resource availability. It plays an important role in the... Show moreJob scheduler is a crucial component in high-performance computing (HPC) systems. It sorts and allocates jobs according to site policies and resource availability. It plays an important role in the efficient use of system resources and users satisfaction. Existing HPC job schedulers typically leverage simple heuristics to schedule jobs. However, the rapid growth in system infrastructure and the introduction of diverse workloads pose serious challenges to the traditional heuristic approaches. First, the current approaches concentrate on CPU footprint and ignore the performance of other resources. Second, the scheduling policies are manually designed and only consider some isolated job information, such as job size and runtime estimate. Such a manual design process prevents the schedulers from making informative decisions by extracting the abundant environment information (i.e., system and queue information). Moreover, they can hardly adapt to workload changes, leading to degraded scheduling performance. These challenges call for a new job scheduling framework that can extract useful information from diverse workloads and the increasingly complicated system environment, and finally make well-informed scheduling decisions in real time.In this work, we propose an intelligent HPC job scheduling framework to address these emerging challenges. Our research takes advantage of advanced machine learning and optimization methods to extract useful workload- and system-specific information and to further educate the framework to make efficient scheduling decisions under various system configurations and diverse workloads. The framework contains four major efforts. First, we focus on providing more accurate job runtime estimations. Estimated job runtime is one of the most important factors affecting scheduling decisions. However, user provided runtime estimates are highly inaccurate and existing solutions are prone to underestimation which causes jobs to be killed. We leverage and enhance a machine learning method called Tobit model to improve the accuracy of job runtime estimates at the same time reduce underestimation rate. More importantly, using TRIP’s improved job runtime estimates boosts scheduling performance by up to 45%. Second, we conduct research on multi-resource scheduling. HPC systems are undergoing significant changes in recent years. New hardware devices, such as GPU and burst buffer, have been integrated into production HPC systems, which significantly expands the schedulable resources. Unfortunately, the current production schedulers allocate jobs solely based on CPU footprint, which severely hurts system performance. In our work, we propose a framework taking all scalable resources into consideration by transforming this problem into multi-objective optimization (MOO) problem and rapid solving it via genetic algorithm. Next, we leverage reinforcement learning (RL) to automatically learn efficient workload- and system-specific scheduling policies. Existing HPC schedulers either use generalized and simple heuristics or optimization methods that ignore workload and system characteristics. To overcome this issue, we design a new scheduling agent DRAS to automatically learn efficient scheduling policies. DRAS leverages the advance in deep reinforcement learning and incorporates the key features of HPC scheduling in the form of a hierarchical neural network structure. We develop a three-phase training process to help DRAS effectively learn the scheduling environment (i.e., the system and its workloads) and to rapidly converge to an optimal policy. Finally, we explore the problem of scheduling mixed workloads, i.e., rigid, malleable and on-demand workloads, on a single HPC system. Traditionally, rigid jobs are the main tenants of HPC systems. In recent years, malleable applications, i.e., jobs that can change sizes before and during execution, are emerging on HPC systems. In addition, dedicated clusters were the main platforms to run on-demand jobs, i.e., jobs needed to be completed in the shortest time possible. As the sizes of on-demand jobs are growing, HPC systems become more cost-efficient platforms for on-demand jobs. However, existing studies do not consider the problem of scheduling all three types of workloads. In our work, we propose six mechanisms, which combine checkpointing, shrink, expansion techniques, to schedule the mixed workloads on one HPC system. Show less