Rapid advances in digital sensors, networks, storage, and computation coupled with decreasing costs is leading to the creation of huge collections of data. Increasing data volumes, particularly in... Show moreRapid advances in digital sensors, networks, storage, and computation coupled with decreasing costs is leading to the creation of huge collections of data. Increasing data volumes, particularly in science and engineering, has resulted in the widespread adoption of parallel and distributed file systems for storing and accessing data efficiently. However, as file system sizes and the amount of data ``owned” by users grows, it is increasingly difficult to discover and locate data amongst the petabytes of data. While much research effort has focused on methods to efficiently store and process data, there has been relatively little focus on methods to efficiently explore, index, and search data using the same high-performance storage and compute systems. Users of large file systems either invest significant resources to implement specialized data catalogs for accessing and searching data, or resort to software tools that were not designed to exploit modern hardware. While it is now trivial to quickly discover websites from the billions of websites accessible on the Internet, it remains surprisingly difficult for users to search for data on large-scale storage systems. We initially explored the prospect of using existing search engine building blocks (e.g. CLucene) to integrate search in a high-performance distributed file system (e.g. FusionFS), by proposing and building the FusionDex system, a distributed indexing and query model for unstructured data. We found indexing performance to be orders of magnitude slower than theoretical speeds we could achieve in raw storage input and output, and sought to investigate a new clean-slate design for high-performance indexing and search.We proposed the SCANNS indexing framework to address the problem of efficiently indexing data in high-end systems, characterized by many-core architectures, with multiple NUMA nodes and multiple PCIe NVMe storage devices. We designed SCANNS as a single-node framework that can be used as a building block for implementing high-performance indexed search engines, where the software architecture of the framework is scalable by design. The indexing pipeline is exposed and allows easy modification and tuning, enabling SCANNS to saturate storage, memory and compute resources on different hardware. The proposed indexing framework uses a novel tokenizer and inverted index design to achieve high performance improvement both in terms of indexing and in terms of search latency. Given the large amounts and the variety of data found in scientific large-scale file systems, it stands to reason to try to bridge the gap between various data representations and to build and provide a more uniform search space. ScienceSearch is a search infrastructure for scientific data that uses machine learning to automate the creation of metadata tags from different data sources, such as published papers, proposals, images and file system structure. ScienceSearch is a production system that is deployed on a container service platform at NERSC and provides search over data obtained from NCEM. We conducted a performance evaluation of the ScienceSearch infrastructure focusing on scalability trends in order to better understand the implications of performing search over an index built from the generated tags. Drawing from the insights gained from SCANNS and the performance evaluation of ScienceSearch, we explored the problems of efficiently building and searching persistent indexes that do not fit into main memory. The SCIPIS framework builds on top of SCANNS and further optimizes the inverted index design and indexing pipeline, by exposing new tuning parameters that allows the user to further adapt the index to the characteristics of the input data. The proposed framework allows the user to quickly build a persistent index and to efficiently run TFIDF queries over the built index. We evaluated SCIPIS over three kinds of datasets (logs, scientific data, and file system metadata) and showed that it achieves high indexing and search performance and good scalability across all datasets. Show less