Search results
(1 - 3 of 3)
- Title
- Data used to develop #Polar scores
- Creator
- Culotta, Aron, Hemphill, Libby, Heston, Matthew
- Date
- 2013, 2016
- Description
-
We present a new approach to measuring political polarization, including a novel algorithm and open source Python code, which leverages...
Show moreWe present a new approach to measuring political polarization, including a novel algorithm and open source Python code, which leverages Twitter content to produce measures of polarization for both users and hashtags. #Polar scores provide advantages over existing measures because they (1) can be calculated throughout the legislative cycle, (2) allow for easy differentiation between users with similar scores, (3) are chamber-agnostic, and (4) are a generic approach that can be applied beyond the U.S. Congress. #Polar scores leverage available information such as party labels, word frequency, and hashtags to create an accessible, straightforward algorithm for estimating polarity using text. (from the paper: Hemphill, L., Culotta, A., and Heston, M. (forthcoming) #Polar Scores: Measuring partisanship using social media content. Journal of Information Technology & Politics.)
The dataset contains one plain text TSV file with the following information for each of the 55,244 tweets used to develop #Polar scores : tweet_id, created_at, user_id, screen_name, tag, shortid, sex, party, state, chamber, name. The file contains one row per hashtag, and therefore tweets may appear more than once. The Python code for calculating #Polar scores is available here: http://doi.org/10.5281/zenodo.53888
Show less
- Title
- Machine Learning at the Bureau of Labor Statistics
- Creator
- Ellis, Robert, Kannan, Vinesh
- Date
- 2019-11-21
- Description
-
Vinesh Kannan (CS '19) shares his experiences working as a...
Show moreVinesh Kannan (CS '19) shares his experiences working as a data science fellow at the Bureau of Labor Statistics (BLS). Vinesh worked on the team that produces occupation and wage data used by policymakers, hiring staff, job seekers, and researchers across the country. He helped improve machine learning systems at the BLS: automatically identifying problematic training data and classifying rare jobs. Vinesh offers advice for students who may be interested in applying for the 2020 Civic Digital Fellowship, a program that recruits university students at all levels to spend a summer working on civic technology projects with various federal agencies.
Sponsorship: College of Science, Department of Computer Science, Department of Applied Mathematics, Machine Learning at IIT
Show less
- Title
- Towards In-Network Semantic Analysis: A Case Study involving Spam Classification
- Creator
- Gueyraud, Cyprien, Sultana, Nik
- Date
- 2023-03-06
- Description
-
Analyzing free-form natural language expressions “in the network”—that is, on programmable switches and smart NICs—would enable packet...
Show moreAnalyzing free-form natural language expressions “in the network”—that is, on programmable switches and smart NICs—would enable packet-handling decisions that are based on the textual content of flows. This analysis would support richer, latency-critical data services that depend on language analysis—such as emergency response, misinformation classification, customer support, and query-answering applications. But packet forwarding and processing decisions usually rely on simple analyses based on table look-ups that are keyed on well-defined (and usually fixed size) header fields. P4 is the state of the art domain-specific language for programming network equipment, but, to the best of our knowledge, analyzing free-form text using P4 has not yet been investigated. Although there is an increasing variety of P4-programmable commodity network hardware available, using P4 presents considerable technical challenges for text analysis since the language lacks loops and fractional datatypes. This paper presents the first Bayesian spam classifier written in P4 and evaluates it using a standard dataset. The paper contributes techniques for the tokenization, analysis, and classification of free-form text using P4, and investigates trade-offs between classification accuracy and resource usage. It shows how classification accuracy can be tuned between 69.1% and 90.4%, and how resource usage can be reduced to 6% by trading-off accuracy. It uses the spam filtering use-case to motivate the need for more research into in network text analysis to enable future “semantic analysis” applications in programmable networks.
Show less