High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As the system ensemble size continues to... Show moreHigh-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As the system ensemble size continues to grow, the occurrence of failures is the norm rather than the exception during the execution of parallel applications. Resilience is widely recognized as one of the key obstacles towards Exascale computing. Checkpointing is currently the de-facto fault tolerant mechanism for parallel applications. However, parallel checkpointing at scale usually generates bursts of concurrent I/O requests, imposes considerable overhead to I/O subsystems, and limits the scalability of parallel applications. Despite the doubt in the feasibility of checkpointing continues to increase, there is still no promising alternative on the horizon yet to replace checkpointing. MapReduce is a new programming model for massive data processing. It has demonstrated a compelling potential in reshaping the landscape of HPC from various perspectives. The resilience of MapReduce applications and its potential in benefiting HPC fault tolerance are active research topics that require extensive investigation. This thesis work targets at building a systematic framework to support resilience in large-scale parallel systems. We address the identified checkpointing performance issue through a three-fold approach: reduce the I/O overhead, exploit storage alternatives, and determine the optimistic checkpointing frequency. This three-fold approach is achieved with three different mechanisms, namely system coordination and scheduling, the utilization of MapReduce framework, and stochastic modeling. To deal with the increasing concerns about MapReduce resilience, we also strive to improve the reliability of MapReduce applications, and investigate the tradeoffs in the programming model selection (e.g., MPI v.s. MapReduce) from the perspective of resilience. This thesis provides a thorough study and a practical solution for solving the outstanding resilience problem of large-scale MPI-based HPC applications and beyond. It makes a noticeable contribution to the state-of-the-art and opens a new research direction for many to follow. Ph.D. in Computer Science, May 2012 Show less