Scientific applications and other High Performance applications generate large amounts of data. It’s said that unstructured data comprises... Show moreScientific applications and other High Performance applications generate large amounts of data. It’s said that unstructured data comprises more than 90% of the world’s information [IDC2011], and it’s growing 60% annually [Grantz2008]. The large amounts of data generated from computation leads to data been dispersed over the file system. Problems begin to exist when we need to locate these files for later use. For small amount of files this might not be an issue but as the number of files begin to grow as well as the increase in size of these files, it becomes difficult locating these files on the file system using ordinary methods like GNU Grep [8], which is commonly used in High Performance Computing and Many-Task Computing environments. It is as a result of this problem that we have chosen this thesis to tackle the problem of finding files in a distributed system environment. Our work leverages the FusionFS [1] distributed file system and the Apache Lucene [10] centralized indexing engine as a fundamental building block. We designed and implemented a distributed search interface within the FusionFS file system that makes both indexing and searching the index across a distributed system simple. We have evaluated our system up to 64 nodes, compared it with Grep, Hadoop, and Cloudera, and have shown that FusionFS’s indexing capabilities have lower overheads and faster response times. M.S. in Computer Science, May 2016 Show less
Enabled Filters
(-) ≠ Pseudo Self-evolving Cerebellar Model Articulation Controller (PSECMAC)