Search results
(1  2 of 2)
 Title
 Independence and Graphical Models for Fitting Real Data
 Creator
 Cho, Jason Y.
 Date
 2023
 Description

Given some real life dataset where the attributes of the dataset take on categorical values, with corresponding r(1) × r(2) × … × r(m)...
Show moreGiven some real life dataset where the attributes of the dataset take on categorical values, with corresponding r(1) × r(2) × … × r(m) contingency table with nonzero rows or nonzero columns, we will be testing the goodnessoffit of various independence models to the dataset using a variation of MetropolisHastings that uses Markov bases as a tool to get a Monte Carlo estimate of the pvalue. This variation of MetropolisHastings can be found in Algorithm 3.1.1. Next we will consider the problem: ``out of all possible undirected graphical models each associated to some graph with m vertices that we test to fit on our dataset, which one best fits the dataset?" Here, the m attributes are labeled as vertices for the graph. We would have to conduct 2^(mC2) goodnessoffit tests since there are 2^(mC2) possible undirected graphs on m vertices. Instead, we consider a backwards selection method likelihoodratio test algorithm. We first start with the complete graph G = K(m), and call the corresponding undirected graphical model ℳ(G) as the parent model. Then for each edge e in E(G), we repeatedly apply the likelihoodratio test to test the relative fit of the model ℳ(Ge), the child model, vs. ℳ(G), the parent model, where ℳ(Ge) ⊆ℳ(G). More details on this iterative process can be found in Algorithm 4.1.3. For our dataset, we will be using the alcohol dataset found in https://www.kaggle.com/datasets/sooyoungher/smokingdrinkingdataset, where the four attributes of the dataset we will use are ``Gender" (male, female), ``Age", ``Total cholesterol (mg/dL)", and ``Drinks alcohol or not?". After testing the goodnessoffit of three independence models corresponding to the independence statements ``Gender vs Drink or not?", ``Age vs Drink or not?", and "Total cholesterol vs Drink or not?", we found that the data came from a distribution from the two independence models corresponding to``Age vs Drink or not?" and "Total cholesterol vs Drink or not?" And after applying the backwards selection likelihoodratio method on the alcohol dataset, we found that the data came from a distribution from the undirected graphical model associated to the complete graph minus the edge {``Total cholesterol”, ``Drink or not?”}.
Show less
 Title
 Independence and Graphical Models for Fitting Real Data
 Creator
 Cho, Jason Y.
 Date
 2023
 Description

Given some real life dataset where the attributes of the dataset take on categorical values, with corresponding r(1) × r(2) × … × r(m)...
Show moreGiven some real life dataset where the attributes of the dataset take on categorical values, with corresponding r(1) × r(2) × … × r(m) contingency table with nonzero rows or nonzero columns, we will be testing the goodnessoffit of various independence models to the dataset using a variation of MetropolisHastings that uses Markov bases as a tool to get a Monte Carlo estimate of the pvalue. This variation of MetropolisHastings can be found in Algorithm 3.1.1. Next we will consider the problem: ``out of all possible undirected graphical models each associated to some graph with m vertices that we test to fit on our dataset, which one best fits the dataset?" Here, the m attributes are labeled as vertices for the graph. We would have to conduct 2^(mC2) goodnessoffit tests since there are 2^(mC2) possible undirected graphs on m vertices. Instead, we consider a backwards selection method likelihoodratio test algorithm. We first start with the complete graph G = K(m), and call the corresponding undirected graphical model ℳ(G) as the parent model. Then for each edge e in E(G), we repeatedly apply the likelihoodratio test to test the relative fit of the model ℳ(Ge), the child model, vs. ℳ(G), the parent model, where ℳ(Ge) ⊆ℳ(G). More details on this iterative process can be found in Algorithm 4.1.3. For our dataset, we will be using the alcohol dataset found in https://www.kaggle.com/datasets/sooyoungher/smokingdrinkingdataset, where the four attributes of the dataset we will use are ``Gender" (male, female), ``Age", ``Total cholesterol (mg/dL)", and ``Drinks alcohol or not?". After testing the goodnessoffit of three independence models corresponding to the independence statements ``Gender vs Drink or not?", ``Age vs Drink or not?", and "Total cholesterol vs Drink or not?", we found that the data came from a distribution from the two independence models corresponding to``Age vs Drink or not?" and "Total cholesterol vs Drink or not?" And after applying the backwards selection likelihoodratio method on the alcohol dataset, we found that the data came from a distribution from the undirected graphical model associated to the complete graph minus the edge {``Total cholesterol”, ``Drink or not?”}.
Show less