Search results
(1 - 2 of 2)
- Title
- ANYTIME ACTIVE LEARNING DISSERTATION
- Creator
- Ramirez Loaiza, Maria E.
- Date
- 2016, 2016-05
- Description
-
Machine learning is a subfield of artificial intelligence which deals with algorithms that can learn from data. These methods provide...
Show moreMachine learning is a subfield of artificial intelligence which deals with algorithms that can learn from data. These methods provide computers with the ability to learn from past data and make predictions for new data. A few examples of machine learning applications include automated document categorization, spam detection, speech recognition, face detection and recognition, language translation, and self-driving cars. A common scenario for machine learning is supervised learning where the algorithm analyzes known examples to train a model that can identify a concept. For instance, given example documents that are pre-annotated as personal, work, family, etc., a machine learning algorithm can be trained to automate organizing your documents folder. In order to train a model that makes as few mistakes as possible, the algorithm needs many training examples (e.g., documents and their categories). Obtaining these examples often involves consulting the human user/expert whose time is limited and valuable. Hence, the algorithm needs to utilize the human’s time as efficiently as possible by focusing on the most cost-effective and informative examples that would make learning more efficient. Active learning is a technique where the algorithm selects which examples would be most cost-effective and beneficial for consultation with the human. In a typical active learning setting, the algorithm simply chooses the examples that should be asked to the expert. In this thesis, we take this one step further: we observe that we can make even better use of the expert’s time by showing not the full example but only the relevant pieces of it, so that the expert can focus on what is relevant and can provide the answer faster. For example, in document classification, the expert does not need to see the full document to categorize it; if the algorithm can show only the relevant snippet to the expert, the expert should be able to categorize the document much faster. However, automatically finding the relevant snippet is not a trivial task; showing an incorrect snippet can either hinder the expert’s ability to provide an answer at all (if the snippet is irrelevant) or even cause the expert to provide incorrect information (if the snippet is misleading). For this to work, the algorithm needs to find a snippet to show the expert, estimate how much time the expert will spend on that snippet, and predict if the expert will return an answer at all. Further, the algorithm would estimate the likelihood of the expert returning the correct answer. Similar to anytime algorithms that can find better solutions as they are given more time, we call the proposed set of methods anytime active learning where the experts are expected to give better answers as they are shown longer snippets. In this thesis, we focus on three aspects of anytime active learning: i) anytime active learning with document truncation where the algorithm assumes that the first words, sentences, and paragraphs of the document are most informative and it has to decide on the snippet length, i.e., where to truncate the document, ii) given a document, the algorithm optimizes for both snippet location and length, and lastly, iii) the algorithm chooses not only the snippet location and size but also chooses which documents to choose snippets from so that the snippet length, the correctness of the expert’s response, and the informativeness of the document are all optimized in a unified framework.
Ph.D. in Computer Science, May 2016
Show less
- Title
- Removing Confounds in Text Classification for Computational Social Science
- Creator
- Landeiro Dos Reis, Virgile
- Date
- 2018
- Description
-
Nowadays, one can use social media and other online platforms to communicate with friends and family, write a review for a product, ask...
Show moreNowadays, one can use social media and other online platforms to communicate with friends and family, write a review for a product, ask questions about a topic of interest, or even share details of private life with the rest of the world. The ever-increasing amount of user-generated content has provided researchers with data that can offer insights on human behavior. Because of that, the field of computational social science - at the intersection of machine learning and social sciences - has soared in the past years, especially within the field of public health research. However, working with large amounts of user-generated data creates new issues. In this thesis, we propose solutions for two problems encountered in computational social science and related to confounding bias.First, because of the anonymity provided by online forums, social networks, or other blogging platforms through the common usage of usernames, it is hard to get accurate information about users such as gender, age, or ethnicity. Therefore, although collecting data on a specific topic is made easier, conducting an observational study with this type of data is not simple. Indeed, when one wishes to run a study to measure the effect of a variable on another variable, one needs to control for potential confounding variables. In the case of user-generated data, these potential confounding variables are at best noisily observed or inferred and at worst not observed at all. In this work, we wish to provide a way to use these inferred latent attributes in order to conduct an observational study while reducing the effect of confounding bias as much as possible. We first present a simple matching method in a large-scale observational study. Then, we propose a method to retrieve relevant and representative documents through adaptive query building in order to build the treatment and control groups of an observational study.Second, we focus on the problem of controlling for confounding variables when the influence of these variables on the target variable of a classification problem changes over time. Although identifying and controlling for confounding variables has been assiduously studied in empirical social science, it is often neglected in text classification. This can be understood by the fact that, if we assume that the impact of confounding variables does not change between the training and the testing data, then prediction accuracy should only be slightly affected. Yet, this assumption often does not hold when working with user-generated text. Because of this, computational science studies are at risk of reaching false conclusions when based on text classifiers that are not controlling for confounding variables. In this document, we propose to build a classifier that is robust to confounding bias shift, and we show that we can build such a classifier in different situations: when there are one or more observed confounding variables, when there is one noisily predicted confounding variable, or when the confounding variable is unknown but can be detected through topic modeling.
Show less