Provenance, information about the origin of data and the queries and/or updates that produced it, is critical for debugging queries and... Show moreProvenance, information about the origin of data and the queries and/or updates that produced it, is critical for debugging queries and transactions, auditing, establishing trust in data, and many other use cases.While how to model and capture the provenance of database queries has been studied extensively, optimization was recognized as an important problem in provenance management which includes storing, capturing, querying provenance and so on. However, previous work has almost exclusively focused on how to compress provenance to reduce storage cost, there is a lack of work focusing on optimizing provenance capture process. Many approaches for capturing database provenance are using SQL query language and representing provenance information as a standard relation. However, even sophisticated query optimizers often fail to produce efficient execution plans for such queries because of the query complexity and uncommon structures. To address this problem, we study algebraic equivalences and alternative ways of generating queries for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization framework utilizing these optimizations.
While provenance has been well studied, no database optimizer is aware of using provenance information to optimize the query processing. Intuitively, provenance records exactly what data is relevant for a query. We can use this feature of provenance to figure out and filter out irrelevant input data of a query early on and such that the query processing will be speeded up. The reason is that instead of fully accessing the input dataset, we only run the query on the relevant input data. In this work, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches which are concise encodings of what data is relevant for a query. In addition, a provenance sketch captured for one query is used to speed up subsequent queries, possibly by utilizing physical design artifacts such as indexes and zone maps.
The work we present in this thesis demonstrates a tight integration between provenance management and query optimization can lead a significant performance improvement of query processing as well as traditional database management task. Show less