Current web search engines only search unstructured data (web documents). A lot of data is available that is structured (weather, census,... Show moreCurrent web search engines only search unstructured data (web documents). A lot of data is available that is structured (weather, census, sports, etc.). A search engine that could search both structured data and text would be able to answer questions such as "show me the top three restaurants in Chicago" without simply looking for the word "three." Previous IPRO teams have been building a Java-based metasearch engine. Numerous issues have arisen. How do we identify which data source will be best for the query? What is the best user interface for the system? For Fall 2001, the team will extend the search engine to include more sources. The team could really use the talents of non-computer science students to help test the system and improve the user interface. Students should plan to work on this at least ten hours a week - anything less will result in a poor grade for the IPRO. Students for this IPRO are graded heavily based on the success of the team, not just individual performance. Weekly progress reports are required. Sponsorship: IIT Collaboratory for Interprofessional Studies Project Plan for EnPRO 356: Intranet Mediators for the Fall 2001 semester Show less
Scientific applications and other High Performance applications generate large amounts of data. It’s said that unstructured data comprises... Show moreScientific applications and other High Performance applications generate large amounts of data. It’s said that unstructured data comprises more than 90% of the world’s information [IDC2011], and it’s growing 60% annually [Grantz2008]. The large amounts of data generated from computation leads to data been dispersed over the file system. Problems begin to exist when we need to locate these files for later use. For small amount of files this might not be an issue but as the number of files begin to grow as well as the increase in size of these files, it becomes difficult locating these files on the file system using ordinary methods like GNU Grep [8], which is commonly used in High Performance Computing and Many-Task Computing environments. It is as a result of this problem that we have chosen this thesis to tackle the problem of finding files in a distributed system environment. Our work leverages the FusionFS [1] distributed file system and the Apache Lucene [10] centralized indexing engine as a fundamental building block. We designed and implemented a distributed search interface within the FusionFS file system that makes both indexing and searching the index across a distributed system simple. We have evaluated our system up to 64 nodes, compared it with Grep, Hadoop, and Cloudera, and have shown that FusionFS’s indexing capabilities have lower overheads and faster response times. M.S. in Computer Science, May 2016 Show less