Search results
(1 - 3 of 3)
- Title
- WHY AND WHY-NOT PROVENANCE FOR QUERIES WITH NEGATION
- Creator
- Lee, Seokki
- Date
- 2020
- Description
-
Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including...
Show moreExplaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, hypothetical reasoning about data, and data exploration. Both types of questions, i.e., why and why-not provenance, have been studied extensively, but mostly in isolation. A recent study shows that unification of why and why-not provenance can be achieved by developing a provenance model for queries with negation. In many complex queries, negation is natural and yields more expressive power. Thus, supporting both types of provenance and negation together can be useful for, e.g., debugging (missing) data over complex queries with negation. However, why-not provenance and — to a lesser degree — why provenance, can be very large resulting in severe scalability and usability challenges.In this thesis, we introduce a framework that unifies why and why-not provenance. We develop a graph-based provenance model that is powerful enough to encode the evaluation of queries with negation (First-Order queries). We demonstrate that our model generalizes a wide range of provenance models from the literature. Using our model, we present the first practical approach that efficiently generates explanations, i.e., parts of the provenance that are relevant to the query outputs of interest. Furthermore, we present a novel approximate summarization technique to address the scalability and usability challenges. Our technique efficiently computes pattern-based provenance summaries that balance informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. We implement these techniques in our PUG (Provenance Unification through Graphs) system which runs on top of a relational database. We demonstrate through extensive experiments that our approach scales to large datasets and produces comprehensive and meaningful (summaries of) provenance.
Show less
- Title
- Effect of Pre-Processing Data on Fairness and Fairness Debugging using GOPHER
- Creator
- Sarkar, Mousam
- Date
- 2023
- Description
-
At present, Artificial intelligence has been contributing to the decision-making process heavily. Bias in machine learning models has existed...
Show moreAt present, Artificial intelligence has been contributing to the decision-making process heavily. Bias in machine learning models has existed throughout and present studies’ direct usage of eXplainable Artificial Intelligence (XAI) approaches to identify and study bias. To solve the problem of locating bias and then mitigating it has been achieved by Gopher [1]. It generates interpretable top-k explanations for the unfairness of the model and it also identifies subsets of training data that are the root cause of this unfair behavior. We utilize this system to study the effect of pre-processing on bias through provenance. The concept of data lineage through tagging of data points during and after the pre-processing stage is implemented. Our methodology and results provide a useful point of reference for studying the relation of pre-processing data with the unfairness of the machine learning model.
Show less
- Title
- Integrating Provenance Management and Query Optimization
- Creator
- Niu, Xing
- Date
- 2021
- Description
-
Provenance, information about the origin of data and the queries and/or updates that produced it, is critical for debugging queries and...
Show moreProvenance, information about the origin of data and the queries and/or updates that produced it, is critical for debugging queries and transactions, auditing, establishing trust in data, and many other use cases.While how to model and capture the provenance of database queries has been studied extensively, optimization was recognized as an important problem in provenance management which includes storing, capturing, querying provenance and so on. However, previous work has almost exclusively focused on how to compress provenance to reduce storage cost, there is a lack of work focusing on optimizing provenance capture process. Many approaches for capturing database provenance are using SQL query language and representing provenance information as a standard relation. However, even sophisticated query optimizers often fail to produce efficient execution plans for such queries because of the query complexity and uncommon structures. To address this problem, we study algebraic equivalences and alternative ways of generating queries for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization framework utilizing these optimizations. While provenance has been well studied, no database optimizer is aware of using provenance information to optimize the query processing. Intuitively, provenance records exactly what data is relevant for a query. We can use this feature of provenance to figure out and filter out irrelevant input data of a query early on and such that the query processing will be speeded up. The reason is that instead of fully accessing the input dataset, we only run the query on the relevant input data. In this work, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches which are concise encodings of what data is relevant for a query. In addition, a provenance sketch captured for one query is used to speed up subsequent queries, possibly by utilizing physical design artifacts such as indexes and zone maps. The work we present in this thesis demonstrates a tight integration between provenance management and query optimization can lead a significant performance improvement of query processing as well as traditional database management task.
Show less