Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. In this thesis, a new... Show moreMean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. In this thesis, a new approach for failure prediction based on the Void Search (VS) algorithm is presented . VS is used primarily in astrophysics for nding areas of space that have a very low den- sity of galaxies. We explore its potential for failure prediction using environmental information and compare it to well known prediction methods. Another important issue for the HPC community is that next-generation supercomputers are expected to have more components and consume several times less energy per operation. Hence, supercomputer designers are pushing the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramati- cally in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the hardware. Techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. Results show that it is possible to detect a large number of corruptions (i.e., above 90% in some cases) with less than 100% overhead using these techniques. Nevertheless, these data-analytic solutions are still far from fully pro- tecting applications to a level comparable with more expensive solutions such as full replication. In this thesis, partial replication is explored to overcome this limitation. More speci cally, it has been observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, one can smartly choose and replicate only those processes for which the lightweight data- analytic detectors would perform poorly. Results indicate that this new approach can protect the MPI applications analyzed with 7{70% less overhead (depending on the application) than that of full duplication with similar detection recall. Ph.D. in Computer Science, May 2017 Show less