The batch scheduler is an important system software serving as the interface between users and HPC systems. Users submit their jobs via batch... Show moreThe batch scheduler is an important system software serving as the interface between users and HPC systems. Users submit their jobs via batch scheduling portal and the batch scheduler makes scheduling decision for each job based on its request for system resources and system availability. Jobs submitted to HPC systems are usually parallel applications and their lifecycle consists of multiple running phases, such as computation, communication and input/output data. Thus, the running of such parallel applications could involve various system resources, such as power, network bandwidth, I/O bandwidth, storage, etc. And most of these system resources are shared among concurrently running jobs. However, Today's batch schedulers do not take the contention and interference between jobs over these resources into consideration for making scheduling decisions, which has been identified as one of the major culprits for both the system and application performance variability. In this work, we propose a cooperative batch scheduling framework for HPC systems. The motivation of our work is to take important factors about jobs and the system, such as job power, job communication characteristics and network topology, for making orchestrated scheduling decisions to reduce the contention between concurrently running jobs and to alleviate the performance variability. Our contributions are the design and implementation of several coordinated scheduling models and algorithms for addressing some chronic issues in HPC systems. The proposed models and algorithms in this work have been evaluated by the means of simulation using workload traces and application communication traces collected from production HPC systems. Preliminary experimental results show that our models and algorithms can effectively improve the application and the system overall performance, HPC facilities' operation cost, and alleviate the performance variability caused by job interference. Ph.D. in Computer Science, May 2017 Show less