Text analysis: fundamentals

---

# Agenda

* Reminder: Regular Expressions
    * basic understanding of how it workds
    * intro to stringr package 
* Workflow for text analysis  
    * Three examples:
        * Emily Dickinson (basic process with small example)
       * Jane Austen (more complex example with some analysis)
       * Billboard hot 100: merging data, answering quesitons, building workflow

---
class: inverse, middle

# Blast from the past!
## Regular Expressions
[refreser: revisit old slides if needed](https://cfss-macss.netlify.app/slides/32-strings/#1)

---

### What are regular expressions? Why are they for?

We use them to manipulate character data, aka strings.

Regular Expressions or regexes (singular regex): **language for pattern matching**. They are strings containing normal characters and special meta-characters that describe a particular pattern that we want to match in a given text.

Regular Expressions are used:

* in **many programming languages**
* for **any task that deals with text:** NLP or data-cleaning tasks (e.g., find words that include a given set of letters, how often do past tenses occur in a text, find emails or phone numbers, find and replace left over HTML tags from scraping, etc.).

---
# Regex: how we match

* **Anchors**: match a position before or after other characters
* **Types**: matching types of characters
* **Classes**: ranges or sets of characters
* **Quantifiers**: specify how it matches
    * **Repetition**: matching more than a single instance
    * **Patterns and backreferences**: can name and extract specific chunks
* **Lookahead**: specify that certain elements must appear before your chunk (regardless of whether it appears within it)
* **Literal matches and modifiers**: you can specify particular matches (e.g. case)
* **Unicode**: particularly useful if you're working with other languages

---
# Regex: lazy vs greedy

Examples of Lazy or Non-Greedy quantifiers are `??`, `*?`, `+?`, and `{}?`:

* They match as few characters as possible, and stop at the first recurrence of a character (e.g., the regex moves forward through the string one character at a time, and stops at the first match)
* Example: the regex `a+?` will match as few "a" as possible in the string "aaaa". Thus, it matches the first character "a" and is done with it

---

### Regex examples

Examples: download today's in-class materials from the website: `usethis::use_course("CFSS-MACSS/text-analysis-fundamentals")`

Resources:
* [stringr cheat sheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) for a complete overview of all `stringr` functions
* [Chapter 14 "Strings" of R for Data Science](https://r4ds.had.co.nz/strings.html#strings), especially section 14.4 "Tools" for examples of each of these functions
* [Regular expressions cheat sheet](https://www.datacamp.com/cheat-sheet/regular-expresso)
* [Excellent (but a bit complex) tutorial](https://github.com/ziishaned/learn-regex/blob/master/README.md)
* [Transform Strings slides](https://cfss-macss.netlify.app/slides/32-strings/#1)

---

### The `stringr()` package in R

When you use regular expressions for your analysis, most likely you will need to use your regular expression together with one of the functions from the `stringr()` package.

This package includes several functions that let you: detect matches in a string, count the number of matches, extract them. replace them with other values, or split a string based on a match.

---

### The `stringr()` package in R

Fundamental `stringr()` functions:

`str_detect()`: detect matches in a string
`str_count()`: count the number of matches
`str_extract()` and `str_extract_all()`: extract matches
`str_replace()` and `str_replace_all()`: replace matches
`str_split()`: split a string based on a match

Key resources:
* [Chapter 14 "Strings" of R for Data Science](https://r4ds.had.co.nz/strings.html#strings), especially section 14 for examples of each of these functions
* [Cheat sheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)

---

# Basic workflow for text analysis

---

## Basic workflow for text analysis

We can think at the basic workflow as a 4-step process:

1. Obtain your textual data

1. Data cleaning and pre-processing

1. Data transformation

1. Perform analysis

Let's review each step...

---
class: inverse, middle

![visualization of textual analysis process](https://www.tidytextmining.com/images/tmwr_0101.png)

*Source: [Text Mining with R](https://www.tidytextmining.com/tidytext.html#the-unnest_tokens-function)*
---

## 1. Obtain your textual data

**Common data sources for text analysis:**

* Online (Scraping and/or APIs)
* Databases
* PDF documents
* Digital scans of printed materials

---

## 1. Obtain your textual data

**Corpus and document:**

* Textual data are usually referred to as **corpus**: general term to refer to a collection of texts, stored as raw strings (e.g., a set of articles from the NYT, novels by an author, one or multiple books, etc.)

* Each corpus might have separate articles, chapters, pages, or even paragraphs. Each individual unit is called a **document**. You decide what constitutes a document in your corpus.

---