Resource management and job scheduling is a crucial task on large-scale computing systems. Despite years of research on resource management... Show moreResource management and job scheduling is a crucial task on large-scale computing systems. Despite years of research on resource management and scheduling, it has not kept pace with modern changes and technology trends. The study of this thesis is motivated by emerging issues observed in current production supercomputers, caused by reasons such as human behaviors, application characteristics, and increasing system complexity. Specifically, users tend to provide inaccurate parameters for their jobs which are dependent by the scheduler; system owners have diverse goals which are always conflicting with each other. Also, workload characteristics on production supercomputers keep changing unpredictably, making it hard to achieve a sustainable scheduling performance since scheduling policies are largely dependent on workload characteristics. Further, increasing hardware complexity causes system issues and leads to new demands. For example, issues such as node fragmentation, failure interruption, power consumption, and I/O overhead have become common in large-scale systems. Existing resource management systems lack the support for these issues and demands. In this study, we present an integrated resource management and scheduling framework, aiming at addressing emerging issues and challenges in resource management for large-scale production supercomputers. We have designed a set of new schemes, including job parameter prediction, adaptive metric-aware job scheduling, cost-aware job scheduling, and multi-domain job coscheduling. We have implemented these approaches in the production resource manager Cobalt, and evaluated them with real job traces from production supercomputers such as the Blue Gene/P system at Argonne National Laboratory. Experimental results show our schemes can effectively improve job scheduling regarding both user satisfaction and system utilization. Ph.D. in Computer Science, July 2012 Show less