Independence and Graphical Models for Fitting Real Data

creator Cho, Jason Y. Independence and Graphical Models for Fitting Real Data 2023 Spring 2023 Thesis Illinois Institute of Technology MATH / Applied Mathematics advisor Kaul, Hemanshu Mathematics Fisher Graphical Hypothesis Likelihood Metropolis Statistical model en Given some real life dataset where the attributes of the dataset take on categorical values, with corresponding r(1) × r(2) × … × r(m) contingency table with nonzero rows or nonzero columns, we will be testing the goodness-of-fit of various independence models to the dataset using a variation of Metropolis-Hastings that uses Markov bases as a tool to get a Monte Carlo estimate of the p-value. This variation of Metropolis-Hastings can be found in Algorithm 3.1.1. Next we will consider the problem: ``out of all possible undirected graphical models each associated to some graph with m vertices that we test to fit on our dataset, which one best fits the dataset?" Here, the m attributes are labeled as vertices for the graph. We would have to conduct 2^(mC2) goodness-of-fit tests since there are 2^(mC2) possible undirected graphs on m vertices. Instead, we consider a backwards selection method likelihood-ratio test algorithm. We first start with the complete graph G = K(m), and call the corresponding undirected graphical model ℳ(G) as the parent model. Then for each edge e in E(G), we repeatedly apply the likelihood-ratio test to test the relative fit of the model ℳ(G-e), the child model, vs. ℳ(G), the parent model, where ℳ(G-e) ⊆ℳ(G). More details on this iterative process can be found in Algorithm 4.1.3. For our dataset, we will be using the alcohol dataset found in https://www.kaggle.com/datasets/sooyoungher/smoking-drinking-dataset, where the four attributes of the dataset we will use are ``Gender" (male, female), ``Age", ``Total cholesterol (mg/dL)", and ``Drinks alcohol or not?". After testing the goodness-of-fit of three independence models corresponding to the independence statements ``Gender vs Drink or not?", ``Age vs Drink or not?", and "Total cholesterol vs Drink or not?", we found that the data came from a distribution from the two independence models corresponding to``Age vs Drink or not?" and "Total cholesterol vs Drink or not?" And after applying the backwards selection likelihood-ratio method on the alcohol dataset, we found that the data came from a distribution from the undirected graphical model associated to the complete graph minus the edge {``Total cholesterol”, ``Drink or not?”}. born digital application/pdf In Copyright http://rightsstatements.org/page/InC/1.0/ Restricted Access http://hdl.handle.net/10560/islandora:1025126