RELIABILITY AND ENERGY ANALYSIS FOR EXTREME SCALE SYSTEMS
MetadataShow full item record
Reliability and energy are two of the top major concerns in the development of today's supercomputers. To build a powerful machine while at the same time satisfying reliability requirement and energy constraint, HPC scientists continue to seek a better understanding of system and component behaviors. Toward this end, modern systems are deployed with various monitoring and logging tools to track reliability and energy data during system operations. Since these data contain important information about system reliability and energy, they are valuable resources for understanding system behaviors. However, as system scale and complexity continue to grow, the process of collecting system data to extracting meaningful knowledge out of overwhelming reliability and energy data faces a number of key challenges. To address these challenges, my work consists of three parts, including data preprocessing, data analysis and advanced modeling.