With the increasing scale and complexity of high performance computing (HPC) systems, reliability management is becoming a major concern.... Show moreWith the increasing scale and complexity of high performance computing (HPC) systems, reliability management is becoming a major concern. System logs are the primary source of information to understand and analyze system problems. Nevertheless, manual log processing is time-consuming, error-prone, and not scalable. Currently little study has been done on automated log analysis for practical use in HPC systems. In this thesis, we present a log analysis infrastructure by exploiting data mining and machine learning technologies. Our work can be broadly divided into four parts: log pre-processing, online failure prediction, automatic root cause diagnosis, and reliability modeling. We evaluate our results by means of system logs collected from production HPC systems. This work can greatly improve our understanding of faults and failures arising from hardware/software components and their interactions. It can further facilitate the reliability management for HPC systems. Ph.D. in Computer Science, July 2012 Show less