Search results
(1 - 2 of 2)
- Title
- AN INTEGRATED DATA ACCESS SYSTEM FOR BIG COMPUTING
- Creator
- Yang, Xi
- Date
- 2016, 2016-07
- Description
-
Big data has entered every corner of the fields of science and engineering and becomes a part of human society. Scientific research and...
Show moreBig data has entered every corner of the fields of science and engineering and becomes a part of human society. Scientific research and commercial practice are increasingly depending on the combined power of high-performance computing (HPC) and high-performance data analytics. Due to its importance, several commercial computing environments have been developed in recent years to support big data applications. MapReduce is a popular mainstream paradigm for large-scale data analytics. MapReduce-based data analytic tools commonly rely on underlying MapReduce file systems (MRFS), such as Hadoop Distributed File System (HDFS), to manage massive amounts of data. In the same time, conventional scientific applications usually run on HPC environments, such as Message Passing Interface (MPI), and their data are kept in parallel file systems (PFS), such as Lustre and GPFS, for high-speed computing and data consistency. As scientific applications become data intensive and big data applications become computing hungry, there is a surging interest and need to integrate HPC power and data processing power to support HPC on big data, the so-called big computing. A fundamental issue of big computing is the integration of data management and interoperability between the conventional HPC ecosystem and the newly emerged data processing/analytic ecosystem. However, data sharing between PFS and MRFS is limited currently, due to semantics mismatches, lacking communication middleware, and the diverged design philosophies and goals, etc. Also, challenges also exist in cross-platform task scheduling and parallelism. At the application layer, the data model mismatch between the raw data kept on file systems and the data management software of an application impedes cross-platform data processing as well. To support cross-platform integration, we propose and develop the Integrated Data Access System (IDAS) for big computing. IDAS extends the accessibilities of programming models and integrates the HPC environment with the data processing MapReduce/Hadoop environment. Under IDAS, MPI applications and MapReduce applications can share and exchange data under PFS and MRFS transparently and efficiently. Through this sharing and exchange, MPI and MapReduce applications can collaboratively provide both high-performance computing and data processing power for a given application. IDAS achieves its goal with several steps. First, IDAS enhances MPI-IO so that MPI-based applications can access data stored in HDFS efficiently. Here the term efficient means that HDFS is enhanced to support MPI-based applications. For instance, we have enhanced HDFS to transparently support N-to-1 file write for better write concurrency. Second, IDAS enhances Hadoop framework to enable MapReduce-based applications process data that resides on PFS transparently. Please notice that we have carefully chosen the term “enhance” here. That is MPI-based applications not only can access data stored on HDFS but also can continue access data stored on PFS. The same is for MapReduce-based applications. Through these enhancements, we achieve seamless data sharing. In addition, we have integrated data accessing with several application tools. In particular, we have integrated image plotting, query, and data subsetting within one application, for Earth Science data analysis. Many data centers prefer erasure-coding rather than triplication to achieve data durability, which trades data availability for lower storage cost. To this end, we have also investigated performance optimization of the erasure coded Hadoop system, to enhance Hadoop system in IDAS.
Ph.D. in Computer Science, July 2016
Show less
- Title
- SCALABLE RESOURCE MANAGEMENT IN CLOUD COMPUTING
- Creator
- Sadooghi, Iman
- Date
- 2017, 2017-05
- Description
-
The exponential growth of data and application complexity has brought new challenges in the distributed computing field. Scientific...
Show moreThe exponential growth of data and application complexity has brought new challenges in the distributed computing field. Scientific applications are growing more diverse with various workloads, including traditional MPI high performance computing (HPC) to fine-grained loosely coupled many-task computing (MTC). Traditionally, these workloads have been shown to run well on supercomputers and highly-tuned HPC Clusters. The advent of Cloud computing has brought the attention of scientists to exploit these resources for scientific applications at a potentially lower cost. We investigate the nature of the cloud and its ability to run scientific applications efficiently. Delivering high throughput and low latency for the various types of workloads at large scales has driven us to design and implement new job scheduling and execution systems that are fully distributed and have the ability to run in public clouds. We discuss the design and implementation of a job scheduling and execution system (CloudKon). CloudKon is optimized to exploit the cloud resources efficiently through a variety of cloud services (Amazon SQS and DynamoDB) in order to get the best performance and utilization. It also supports various workloads including MTC and HPC applications concurrently. To further improve the performance and the flexibility of CloudKon, we designed and implemented a fully distributed message queue (Fabriq) that delivers an order of magnitude better performance than the Amazon Simple Queuing System (SQS). Designing Fabriq helped us expand our scheduling system to many other distributed system including non-Amazon clouds. Having Fabriq as a building block, we were able to design and implement a multipurpose task scheduling and execution framework (Albatross) that is able to efficiently run various types workloads at larger scales. Albatross provides data locality and task execution dependency. Those features enable Albatross to natively run MapReduce workloads. We evaluated CloudKon with synthetic MTC workloads, synthetic HPC workloads, and synthetic MapReduce applications on the Amazon AWS cloud with up to 1K instances. Fabriq was also evaluated with synthetic workloads on Amazon AWS cloud with up to 128 instances. Performance evaluations of Albatross show its ability to outperform Spark and Hadoop on different scenarios.
Ph.D. in Computer Science
Show less