Search results
(1 - 2 of 2)
- Title
- SCALABLE RESOURCE MANAGEMENT SYSTEM SOFTWARE FOR EXTREMESCALE DISTRIBUTED SYSTEMS
- Creator
- Wang, Ke
- Date
- 2015, 2015-07
- Description
-
Distributed systems are growing exponentially in the computing capacity. On the high-performance computing (HPC) side, supercomputers are...
Show moreDistributed systems are growing exponentially in the computing capacity. On the high-performance computing (HPC) side, supercomputers are predicted to reach exascale with billion-way parallelism around the end of this decade. Scientific applications running on supercomputers are becoming more diverse, including traditional large-scale HPC jobs, small-scale HPC ensemble runs, and fine-grained many-task computing (MTC) workloads. Similar challenges are cropping up in cloud computing as data-centers host ever growing larger number of servers exceeding many top HPC systems in production today. The applications commonly found in the cloud are ushering in the era of big data, resulting in billions of tasks that involve processing increasingly large amount of data. However, the resource management system (RMS) software of distributed systems is still designed around the decades-old centralized paradigm, which is far from satisfying the ever-growing needs of performance and scalability towards extreme scales, due to the limited capacity of a centralized server. This huge gap between the processing capacity and the performance needs has driven us to develop next-generation RMSs that are magnitudes more scalable. In this dissertation, we first devise a general system software taxonomy to explore the design choices of system software, and propose that key-value stores could serve as a building block. We then design distributed RMS on top of key-value stores. We propose a fully distributed architecture and a data-aware work stealing technique for the MTC resource management, and develop the SimMatrix simulator to explore the distributed designs, which informs the real implementation of the MATRIX task execution framework. We also propose a partition-based architecture and resource sharing techniques for the HPC resource management, and implement them by building the Slurm++ real workload manager and the SimSlurm++ simulator. We study the distributed designs through real systems up to thousands of nodes, and through simulations up to millions of nodes. Results show that the distributed paradigm has significant advantages over centralized one. We envision that the contributions of this dissertation will be both evolutionary and revolutionary to the extreme-scale computing community, and will lead to a plethora of following research work and innovations towards tomorrow’s extremescale systems.
Ph.D. in Computer Science, July 2015
Show less
- Title
- SCALABLE RESOURCE MANAGEMENT IN CLOUD COMPUTING
- Creator
- Sadooghi, Iman
- Date
- 2017, 2017-05
- Description
-
The exponential growth of data and application complexity has brought new challenges in the distributed computing field. Scientific...
Show moreThe exponential growth of data and application complexity has brought new challenges in the distributed computing field. Scientific applications are growing more diverse with various workloads, including traditional MPI high performance computing (HPC) to fine-grained loosely coupled many-task computing (MTC). Traditionally, these workloads have been shown to run well on supercomputers and highly-tuned HPC Clusters. The advent of Cloud computing has brought the attention of scientists to exploit these resources for scientific applications at a potentially lower cost. We investigate the nature of the cloud and its ability to run scientific applications efficiently. Delivering high throughput and low latency for the various types of workloads at large scales has driven us to design and implement new job scheduling and execution systems that are fully distributed and have the ability to run in public clouds. We discuss the design and implementation of a job scheduling and execution system (CloudKon). CloudKon is optimized to exploit the cloud resources efficiently through a variety of cloud services (Amazon SQS and DynamoDB) in order to get the best performance and utilization. It also supports various workloads including MTC and HPC applications concurrently. To further improve the performance and the flexibility of CloudKon, we designed and implemented a fully distributed message queue (Fabriq) that delivers an order of magnitude better performance than the Amazon Simple Queuing System (SQS). Designing Fabriq helped us expand our scheduling system to many other distributed system including non-Amazon clouds. Having Fabriq as a building block, we were able to design and implement a multipurpose task scheduling and execution framework (Albatross) that is able to efficiently run various types workloads at larger scales. Albatross provides data locality and task execution dependency. Those features enable Albatross to natively run MapReduce workloads. We evaluated CloudKon with synthetic MTC workloads, synthetic HPC workloads, and synthetic MapReduce applications on the Amazon AWS cloud with up to 1K instances. Fabriq was also evaluated with synthetic workloads on Amazon AWS cloud with up to 128 instances. Performance evaluations of Albatross show its ability to outperform Spark and Hadoop on different scenarios.
Ph.D. in Computer Science
Show less