Vinesh Kannan (CS '19) shares his experiences working as a... Show moreVinesh Kannan (CS '19) shares his experiences working as a
data science fellow at the Bureau of Labor Statistics (BLS). Vinesh worked
on the team that produces occupation and wage data used by policymakers,
hiring staff, job seekers, and researchers across the country. He helped
improve machine learning systems at the BLS: automatically identifying
problematic training data and classifying rare jobs. Vinesh offers advice
for students who may be interested in applying for the 2020 Civic Digital
Fellowship, a program that recruits university students at all levels to
spend a summer working on civic technology projects with various federal
agencies. Sponsorship: College of Science, Department of Computer
Science, Department of Applied Mathematics, Machine Learning at IIT Show less
This report provides supplementary technical details to the conference paper that introduced C-Saw, a language for expressing software... Show moreThis report provides supplementary technical details to the conference paper that introduced C-Saw, a language for expressing software architecture patterns. This report provides additional examples of using C-Saw, supplementary evaluation details, and it defines the formal semantics of the language. Show less
Analyzing free-form natural language expressions “in the network”—that is, on programmable switches and smart NICs—would enable packet... Show moreAnalyzing free-form natural language expressions “in the network”—that is, on programmable switches and smart NICs—would enable packet-handling decisions that are based on the textual content of flows. This analysis would support richer, latency-critical data services that depend on language analysis—such as emergency response, misinformation classification, customer support, and query-answering applications. But packet forwarding and processing decisions usually rely on simple analyses based on table look-ups that are keyed on well-defined (and usually fixed size) header fields. P4 is the state of the art domain-specific language for programming network equipment, but, to the best of our knowledge, analyzing free-form text using P4 has not yet been investigated. Although there is an increasing variety of P4-programmable commodity network hardware available, using P4 presents considerable technical challenges for text analysis since the language lacks loops and fractional datatypes.
This paper presents the first Bayesian spam classifier written in P4 and evaluates it using a standard dataset. The paper contributes techniques for the tokenization, analysis, and classification of free-form text using P4, and investigates trade-offs between classification accuracy and resource usage. It shows how classification accuracy can be tuned between 69.1% and 90.4%, and how resource usage can be reduced to 6% by trading-off accuracy. It uses the spam filtering use-case to motivate the need for more research into in network text analysis to enable future “semantic analysis” applications in programmable networks. Show less