Search results
(1 - 1 of 1)
- Title
- Removing Confounds in Text Classification for Computational Social Science
- Creator
- Landeiro Dos Reis, Virgile
- Date
- 2018
- Description
-
Nowadays, one can use social media and other online platforms to communicate with friends and family, write a review for a product, ask...
Show moreNowadays, one can use social media and other online platforms to communicate with friends and family, write a review for a product, ask questions about a topic of interest, or even share details of private life with the rest of the world. The ever-increasing amount of user-generated content has provided researchers with data that can offer insights on human behavior. Because of that, the field of computational social science - at the intersection of machine learning and social sciences - has soared in the past years, especially within the field of public health research. However, working with large amounts of user-generated data creates new issues. In this thesis, we propose solutions for two problems encountered in computational social science and related to confounding bias.First, because of the anonymity provided by online forums, social networks, or other blogging platforms through the common usage of usernames, it is hard to get accurate information about users such as gender, age, or ethnicity. Therefore, although collecting data on a specific topic is made easier, conducting an observational study with this type of data is not simple. Indeed, when one wishes to run a study to measure the effect of a variable on another variable, one needs to control for potential confounding variables. In the case of user-generated data, these potential confounding variables are at best noisily observed or inferred and at worst not observed at all. In this work, we wish to provide a way to use these inferred latent attributes in order to conduct an observational study while reducing the effect of confounding bias as much as possible. We first present a simple matching method in a large-scale observational study. Then, we propose a method to retrieve relevant and representative documents through adaptive query building in order to build the treatment and control groups of an observational study.Second, we focus on the problem of controlling for confounding variables when the influence of these variables on the target variable of a classification problem changes over time. Although identifying and controlling for confounding variables has been assiduously studied in empirical social science, it is often neglected in text classification. This can be understood by the fact that, if we assume that the impact of confounding variables does not change between the training and the testing data, then prediction accuracy should only be slightly affected. Yet, this assumption often does not hold when working with user-generated text. Because of this, computational science studies are at risk of reaching false conclusions when based on text classifiers that are not controlling for confounding variables. In this document, we propose to build a classifier that is robust to confounding bias shift, and we show that we can build such a classifier in different situations: when there are one or more observed confounding variables, when there is one noisily predicted confounding variable, or when the confounding variable is unknown but can be detected through topic modeling.
Show less