Search results
(1 - 1 of 1)
- Title
- AN INTEGRATED DATA ACCESS SYSTEM FOR BIG COMPUTING
- Creator
- Yang, Xi
- Date
- 2016, 2016-07
- Description
-
Big data has entered every corner of the fields of science and engineering and becomes a part of human society. Scientific research and...
Show moreBig data has entered every corner of the fields of science and engineering and becomes a part of human society. Scientific research and commercial practice are increasingly depending on the combined power of high-performance computing (HPC) and high-performance data analytics. Due to its importance, several commercial computing environments have been developed in recent years to support big data applications. MapReduce is a popular mainstream paradigm for large-scale data analytics. MapReduce-based data analytic tools commonly rely on underlying MapReduce file systems (MRFS), such as Hadoop Distributed File System (HDFS), to manage massive amounts of data. In the same time, conventional scientific applications usually run on HPC environments, such as Message Passing Interface (MPI), and their data are kept in parallel file systems (PFS), such as Lustre and GPFS, for high-speed computing and data consistency. As scientific applications become data intensive and big data applications become computing hungry, there is a surging interest and need to integrate HPC power and data processing power to support HPC on big data, the so-called big computing. A fundamental issue of big computing is the integration of data management and interoperability between the conventional HPC ecosystem and the newly emerged data processing/analytic ecosystem. However, data sharing between PFS and MRFS is limited currently, due to semantics mismatches, lacking communication middleware, and the diverged design philosophies and goals, etc. Also, challenges also exist in cross-platform task scheduling and parallelism. At the application layer, the data model mismatch between the raw data kept on file systems and the data management software of an application impedes cross-platform data processing as well. To support cross-platform integration, we propose and develop the Integrated Data Access System (IDAS) for big computing. IDAS extends the accessibilities of programming models and integrates the HPC environment with the data processing MapReduce/Hadoop environment. Under IDAS, MPI applications and MapReduce applications can share and exchange data under PFS and MRFS transparently and efficiently. Through this sharing and exchange, MPI and MapReduce applications can collaboratively provide both high-performance computing and data processing power for a given application. IDAS achieves its goal with several steps. First, IDAS enhances MPI-IO so that MPI-based applications can access data stored in HDFS efficiently. Here the term efficient means that HDFS is enhanced to support MPI-based applications. For instance, we have enhanced HDFS to transparently support N-to-1 file write for better write concurrency. Second, IDAS enhances Hadoop framework to enable MapReduce-based applications process data that resides on PFS transparently. Please notice that we have carefully chosen the term “enhance” here. That is MPI-based applications not only can access data stored on HDFS but also can continue access data stored on PFS. The same is for MapReduce-based applications. Through these enhancements, we achieve seamless data sharing. In addition, we have integrated data accessing with several application tools. In particular, we have integrated image plotting, query, and data subsetting within one application, for Earth Science data analysis. Many data centers prefer erasure-coding rather than triplication to achieve data durability, which trades data availability for lower storage cost. To this end, we have also investigated performance optimization of the erasure coded Hadoop system, to enhance Hadoop system in IDAS.
Ph.D. in Computer Science, July 2016
Show less