Text analysis: fundamentals

Overview

  • Introduce regular expressions
  • Identify the basic workflow for conducting text analysis
  • Define the tidy text formats (Chapter 1)
  • Word frequencies and tf-idf (Chapter 1 and 3)

Before class

For this and the following lecture we use the book Tidy Text Mining with R. Start with Ch 1 and 2.

Class materials

  • Run the code below in your console to download today’s in-class materials: usethis::use_course("CFSS-MACSS/text-analysis-fundamentals")

Additional resources

Tidytext mining read Chapter 1, 3, and 4. Before next class: read chapters 2 and 6

Basic text analysis:

  • The three case studies included in the book “Text Mining with R” (see assigned readings) provide in-depth examples on how to preform text analysis from A-Z. I recommend Chapter 9 Case study: analyzing usenet text
  • For an excellent theoretical explanation of these topics see Speech and Language Processing by Daniel Jurafsky & James H. Martin, especially Chapter 2, 3, and 4

Regular Expressions:

  • Chapter 14 of our textbook R for Data Science
  • Chapter 15 by Rochelle Terman, explains stringr() (using the R for Data Science textbook)
  • For an overview of regular expressions in R see Chapter 17 from R Programming for Data Science. This book covers the entire range of regular expressions packages and functions; in-class we focus only on the stringr() package
  • stringr() documentation and cheatsheet