CFSS: Computing for the Social Sciences

Overview

Introduce regular expressions
Identify the basic workflow for conducting text analysis
Define the tidy text formats (Chapter 1)
Word frequencies and tf-idf (Chapter 1 and 3)

Before class

For this and the following lecture we use the book Tidy Text Mining with R. Before this class: read Chapter 1, 3, and 4. Before next class: read chapters 2 and 6

Class materials

Run the code below in your console to download today’s in-class materials: usethis::use_course("CFSS-MACSS/text-analysis-fundamentals")

Additional resources

Basic text analysis: * The three case studies included in the book “Text Mining with R” (see assigned readings) provide in-depth examples on how to preform text analysis from A-Z. I recommend Chapter 9 Case study: analyzing usenet text * For an excellent theoretical explanation of these topics see Speech and Language Processing by Daniel Jurafsky & James H. Martin, especially Chapter 2, 3, and 4

Regular Expressions: * Chapter 14 of our textbook R for Data Science * Chapter 15 by Rochelle Terman, explains stringr() (using the R for Data Science textbook)
* For an overview of regular expressions in R see Chapter 17 from R Programming for Data Science. This book covers the entire range of regular expressions packages and functions; in-class we focus only on the stringr() package * stringr() documentation and cheatsheet