Text analysis: fundamentals

Overview

Introduce regular expressions
Identify the basic workflow for conducting text analysis
Define the tidy text formats (Chapter 1)
Word frequencies and tf-idf (Chapter 1 and 3)

Before class

For this and the following lecture we use the book Tidy Text Mining with R. Start with Ch 1 and 2.

Class materials

Run the code below in your console to download today’s in-class materials: usethis::use_course("CFSS-MACSS/text-analysis-fundamentals")

Additional resources

Tidytext mining read Chapter 1, 3, and 4. Before next class: read chapters 2 and 6

Basic text analysis:

The three case studies included in the book “Text Mining with R” (see assigned readings) provide in-depth examples on how to preform text analysis from A-Z. I recommend Chapter 9 Case study: analyzing usenet text
For an excellent theoretical explanation of these topics see Speech and Language Processing by Daniel Jurafsky & James H. Martin, especially Chapter 2, 3, and 4

Regular Expressions:

Chapter 14 of our textbook R for Data Science
Chapter 15 by Rochelle Terman, explains stringr() (using the R for Data Science textbook)
For an overview of regular expressions in R see Chapter 17 from R Programming for Data Science. This book covers the entire range of regular expressions packages and functions; in-class we focus only on the stringr() package
stringr() documentation and cheatsheet

© 2025 Jean Clipperton (materials adapted from Benjamin Soltoff and Sabrina Nardin) All Rights Reserved