Search results
(1 - 1 of 1)
- Title
- A Multi-level Data Integration Approach for the Convergence of HPC and Big Data Systems
- Creator
- Feng, Kun
- Date
- 2020
- Description
-
HPC is moving towards exascale (10^18 operations per second) following the trend that has continued for over half a century. Such an extremely...
Show moreHPC is moving towards exascale (10^18 operations per second) following the trend that has continued for over half a century. Such an extremely compelling computing power brings huge opportunities for scientists to explore their problems with larger sizes and finer granularity. As a result, the data volume produced and consumed by extreme-scale computing has increased dramatically. To gain useful scientific insights, scientists analyze tremendous amounts of data, which stresses the storage systems and requires efficient data access. Besides the data volume increase, the variety of I/O subsystems grows as well to meet the drastically different, often conflicting I/O requirements of numerous applications. HPC and BD, as two major camps of extreme-scale computing, have been developed separately for a long time and diverged from computing and storage paradigms. However, recent developments have proven the convergence of them leads to more efficient scientific output. Hence, unification between these ecosystems is necessary to accelerate extreme-scale computing with the collaboration of applications from both camps. Therefore, integrated I/O has become a major issue that needs to be addressed as the extreme computing community moves forward.This study explores improvement by proposing a new integrated data access system for extreme-scale computing. We enhance the BD framework to adapt to the change of integrated data access requirement by enabling direct processing of scientific data from PFS at the HPC site. Our framework can perform up to 8x faster than the state-of-the-art solutions in representative workloads. We design a new advanced I/O middleware service to utilize data aggregation resources to facilitate integrated data access in scientific workflows with both HPC and BD applications. Our middleware service can reach up to 10x speedup against the default solution and 133% better performance than existing solutions. We propose a novel storage integration solution on the storage side to unite all the storage resources, to unify the namespace across all the storage systems, and provide an ultimate integrated data access service. The integrated solution can speed up a real workflow with integrated data access requirements by up to 6.86x over existing solutions. The three-level integration at the application level, middleware level, and storage level provide us a systematic hierarchical I/O integration. Our implementation results show that the three-level optimized design and implementation is feasible and effective. It improves the state-of-the-art solutions and helps us to achieve an enhanced I/O system towards extreme-scale computing to support both HPC and BD applications.
Show less