Transforming: Factors and Dates

---

# R Base Data Structures

<!-- notes:
this is a new lecture that focuses on a general review of data structures
from the original lecture ("vectors-and-iteration") I kept the vectors and lists here 
and I need to enlarge this lecture with more data structures
I do not think I need all libraries listed there, double check

also check this lecture corresponding pages on the syllabus 
this lecture does not have notes (content > notes)
make sure to do that ahead of time to ensure things work!
-->

---

### R Base Data Structures

R data structures:
* Vectors
* Matrices
* Lists
* Data frames
* Arrays

These data structures can be organized by:
- their dimensions (1d, 2d, or nd)
- whether they are homogeneous (all contents must be of the same type, like atomic vectors and matrices) or heterogeneous (contents can be of different types, like lists and data frames)

Please, review:
* Lecture 2 (`introR_lecture`) to define, subset, and manipulate these data structures
* Chapter 20 "Vectors" in R for Data Science

---

### R is fundamentally a vector-based program

So far, we have been using predominantly data frames, which are very common when working with social science data.

However, data frames are not actually the most fundamental type of object in R: **vectors are the ultimate building blocks of objects within R**.

A matrix is made of vectors, a list is made of vectors (a list is still a vector in R but not an atomic one), data frames are made by lists, etc.

Basically in R either something is a vector, or it's NULL...

---

### R is fundamentally a vector-based program

---

# Atomic vectors

---

### Types of atomic vectors

Remember: **All values in an atomic vector must to be of the same type**.

**Logical**: you have used it every time you use a conditional test or operation (e.g., when you filter a data frame)

``` r
logical_vector <- c("TRUE", "TRUE", "FALSE", "TRUE", "NA")
```

**Numeric**: can be integer or double (default)

``` r
integer_vector <- c("1", "5", "3", "4", "12423")
double_vector <- c("4.2", "4", "6", "53.2")
```

**Character**: note you can use single or double quotations, you just need to be consistent

``` r
character_vector <- c("WOOHOO", "'1,2,3 ready!'", "R", '7/2/2025')
```

---

### Scalars

In math a scalar is defined as a single real number. R has no concept of a scalar: **in R, a scalar is simply a vector of length 1**

``` r
# set up a vector x of length 10
(x <- sample(10))
```

```
##  [1] 10  6  5  4  1  8  2  7  9  3
```

``` r
# add 100 to x
x + c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)
```

```
##  [1] 110 106 105 104 101 108 102 107 109 103
```

``` r
# add 100 to x: the R way (vector recycling)
x + 100
```

```
##  [1] 110 106 105 104 101 108 102 107 109 103
```

---

### Vector Recycling

When two vectors are involved in an operation, **R repeats the elements of the shorter vector to match the length of the longer vector**.

This will work for any vector of any length. For example:

``` r
# x1 is sequence of numbers from 1 to 2
(x1 <- seq(from = 1, to = 2))
```

```
## [1] 1 2
```

``` r
# x2 is a sequence of numbers from 1 to 10
(x2 <- seq(from = 1, to = 10))
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

---

### Vector Recycling

If we add `x1` and `x2` together, R will do it, but the result might not be what we expect:

``` r
(x1 + x2)
```

```
##  [1]  2  4  4  6  6  8  8 10 10 12
```

The shorter vector, `x1`, is duplicated five times in order to match the length of the longer vector `x2.`

This behavior is called **vector recycling** and happens automatically in R. You need to pay attention if this is what you intended to do. If not, extend the length of the shorter vector manually first, then add them up.

Note, if the shorter vector is not a multiple of the longer one, R will print a warning message.

---

### Subsetting vectors: slicing

To subset a vector we use the index location of its elements:

``` r
x <- c("one", "two", "three", "four", "five")
```

```
# keep the first element
x[1]

# keep the first through third elements
x[c(1, 2, 3)]   # long way
x[1:3]          # shorter
x[c(seq(1, 3))] # sequence 
x[-c(4:5)]      # negative indexing (values that you do not want to keep)
x[-c(4,5)]      # negative indexing

x[c(-1,2,3)]   # error! do not mix negative and positive subscripts
```

---

### Subset with a logical vector: conditional subsetting

Sometimes, rather than slicing, like we did in the previous example, we want to keep certain values based of a **condition**.

This is more similar to a filtering operation (vs. slicing) and it is a 2-step operation:
1. create a logical vector of TRUEs and FALSEs, that identifies for each of the elements in the original vector, which one we want to keep

2. apply that vector to the vector we want to subset

---

### Subset with a logical vector: conditional subsetting

Given a vector `x`:

``` r
x <- c(NA, 10, 3, 5, 8, 1, NA)
```

We want to keep all the non-missing values in `x`. To find them we can use `is.na()`. This function outputs a logical vector of TRUEs and FALSEs.
Notice the `!` reverts the output, so we get TRUE for non-missing and FALSE for missing values -- we want to keep the TRUEs:

``` r
!is.na(x)
```

```
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
```

Then, we put the function into `[]` to apply it to our `x` vector. This says "keep all elements that are TRUE" in this vector:

``` r
x[!is.na(x)]
```

```
## [1] 10  3  5  8  1
```

---

### Subset with a logical vector: conditional subsetting

This applies to any kind of conditional test. For example, given the same vector `x`:

``` r
x <- c(NA, 10, 3, 5, 8, 1, NA)
```

We might want to get all even or missing values of `x`. To do so, we first use modular division:

``` r
x %% 2 == 0 
```

```
## [1]    NA  TRUE FALSE FALSE  TRUE FALSE    NA
```

Then, we apply it to our vector `x`:

``` r
x[x %% 2 == 0]
```

```
## [1] NA 10  8 NA
```

---

# Lists

---

## Lists

Lists are another type of vector, but they are not atomic vector. They differ from atomic vectors in two main ways:

1. They **store heterogeneous elements** (vs. all values in an atomic vector must be of the same type)
2. They **are structured differently** and are created with the `list()` function, not with the `c()` function. Notice the output is different than the output from an atomic vector:

``` r
x <- list(1, 2, 3)
x
```

```
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
```

---

## Lists structure

List objects are structured as a list of **independent elements**. Use `str()` to see their structure:

``` r
x <- list(1, 2, 3)
str(x)
```

```
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3
```

Here we have a list of length 3, and each of the elements of this list is a numeric atomic vector of length 1.

---

## Lists elements

Unlike atomic vectors, lists can contain **multiple data types**, and we can also name each of them:

``` r
x_named <- list(a = "abc", b = 2, c = c(1, 2, 3))
str(x_named)
```

```
## List of 3
##  $ a: chr "abc"
##  $ b: num 2
##  $ c: num [1:3] 1 2 3
```
Here we have a list of length 3, and each of the elements of this list is a different object: we have a character vector of length 1, one numeric vector of length 1, and one numeric vector of length 3.

---

## Nested lists

You can also store lists inside a list: **nested list structure**.

In this object `z` we have two lists:

``` r
z <- list(list(1, 2), list(3, 4))
str(z)
```

```
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4
```

This is often useful when you interact with API to get data from the web (frequently you get this type of nested list as output when you get data from API).

---

## Secret lists: data frames!

Notice, we have been using lists extensively in the class. Each column of a data frame is a list:

``` r
library(stevedata)
str(chile88)
```

```
## tibble [2,700 × 8] (S3: tbl_df/tbl/data.frame)
##  $ region: chr [1:2700] "N" "N" "N" "N" ...
##  $ pop   : num [1:2700] 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
##  $ sex   : num [1:2700] 0 0 1 1 1 1 0 1 1 0 ...
##  $ age   : num [1:2700] 65 29 38 49 23 28 26 24 41 41 ...
##  $ educ  : chr [1:2700] "P" "PS" "P" "P" ...
##  $ income: num [1:2700] 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
##  $ sq    : num [1:2700] 1.01 -1.3 1.23 -1.03 -1.1 ...
##  $ vote  : chr [1:2700] "Y" "N" "Y" "N" ...
```

The only difference between data frames and actual lists is that the length of each list object in the data frame has to the same (a data frame is rectangular).

---

## Subsetting lists

Lists have a more complex structure than vectors, thus subsetting them also requires more attention.

For example, `a` is a list that contains four elements: 
* a numeric vector
* a character vector
* a numeric vector
* a list object which in turns contains two distinct numeric vectors (notice the space in the middle)

]

]

---

## Subsetting lists

---

# Factors

---

## Factors

* Used for **categorical (discrete) variables**
* Factors store categorical variables values as numbers rather than as characters (e.g., Likert scale)
* Historically used for purposes of efficiency
* Best used to sort categorical variables other than alphabetically (e.g., 1 to 5 rather than alphabetically)
* `forcats` in `tidyverse` to manipulate factors

---

## Character vector

Define a character vectors with four months and sort it:

``` r
(x1 <- c("Dec", "Apr", "Jan", "Mar"))
```

```
## [1] "Dec" "Apr" "Jan" "Mar"
```

``` r
sort(x1)
```

```
## [1] "Apr" "Dec" "Jan" "Mar"
```

Notice the default behavior of R is sorting character vectors alphabetically. As humans, we understand that's not the a very meaningful way to sort months. Instead, we might want to sort months chronologically. To tell that to R, we need to convert them to factors.

---

### Step 1: Levels

To convert a character vector to a factor, the first thing to do is to define all possible values that the variable can take. We do so by creating another character vector:

``` r
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```

---

### Step 2: Factor

We then use the `factor()` or the `parse_factor()` function to convert this character vector into a factor, and apply the given order to it:

``` r
(y1 <- factor(x1, levels = month_levels))
```

```
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

``` r
parse_factor(x1, levels = month_levels)
```

```
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

---

### Step 3: Sort

Finally, we sort the new factor vector `y1`, exactly like we did for the original character vector `x1`:

``` r
# sort y1: chronologically correctly
sort(y1)
```

```
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

``` r
# sort x1: alphabetically sorted
sort(x1)
```

```
## [1] "Apr" "Dec" "Jan" "Mar"
```

---

## Different levels/labels

Another situation you might encounter, is that rather than working directly with character vectors, you might find their numerical representation:

``` r
(x2 <- c(12, 4, 1, 3))
```

```
## [1] 12  4  1  3
```

Define levels and labels separately:

``` r
y2 <- factor(x2,
  levels = seq(from = 1, to = 12),
  labels = month_levels
)
y2
```

```
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

---

## `forcats` package

Provides a suite of tools that solve common problems with factors. Some examples include:

- `fct_reorder()`: Reordering a factor by another variable
- `fct_infreq()`: Reordering a factor by the frequency of values
- `fct_relevel()`: Changing the order of a factor by hand
- `fct_lump()`: Collapsing the least/most frequent values of a factor into “other”

Documentation and Cheat Sheet: https://forcats.tidyverse.org/

---

## Forcats::gss()

``` r
forcats::gss_cat
```

```
## # A tibble: 21,483 × 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
##  3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
##  4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
##  5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
##  6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
##  7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
##  9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
## 10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
## # ℹ 21,473 more rows
```

---
### Ordering bars:

``` r
gss_cat |>
  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
  ggplot(aes(x = marital)) +
  geom_bar()
```

![](index_files/figure-html/unnamed-chunk-12-1.png)

---

### Recoding factors:

``` r
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  )
```

```
## # A tibble: 21,483 × 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Independe… Prot… Sout…      12
##  2  2000 Divorced         48 White $8000 to 9999  Republica… Prot… Bapt…      NA
##  3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
##  4  2000 Never married    39 White Not applicable Independe… Orth… Not …       4
##  5  2000 Divorced         25 White Not applicable Democrat,… None  Not …       1
##  6  2000 Married          25 White $20000 - 24999 Democrat,… Prot… Sout…      NA
##  7  2000 Never married    36 White $25000 or more Republica… Chri… Not …       3
##  8  2000 Divorced         44 White $7000 to 7999  Independe… Prot… Luth…      NA
##  9  2000 Married          44 White $25000 or more Democrat,… Prot… Other       0
## 10  2000 Married          47 White $25000 or more Republica… Prot… Sout…       3
## # ℹ 21,473 more rows
```
]

``` r
gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)
```

```
## # A tibble: 4 × 2
##   partyid     n
##   <fct>   <int>
## 1 other     548
## 2 rep      5346
## 3 ind      8409
## 4 dem      7180
```
]
]

---

## Summarizing factors: `fct_lump_*`
.panelset[.panel[.panel-name[`lump_n`]

``` r
gss_cat |>
  mutate(relig = fct_lump_n(relig, n = 5)) |>
  count(relig, sort = TRUE)
```

```
## # A tibble: 6 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Catholic    5124
## 3 None        3523
## 4 Other        913
## 5 Christian    689
## 6 Jewish       388
```
]
.panel[.panel-name[`lump_min`]

``` r
gss_cat |>
  mutate(relig = fct_lump_min(relig, min = 800, other_level = "Other")) |>
  count(relig, sort = TRUE)
```

```
## # A tibble: 4 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Catholic    5124
## 3 None        3523
## 4 Other       1990
```
]
.panel[.panel-name[`lump_prop`]

``` r
gss_cat |>
  mutate(relig = fct_lump_prop(relig, prop = 0.4, )) |>
  count(relig, sort = TRUE)
```

```
## # A tibble: 2 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Other      10637
```
]
]

---
### Factors and missingness

``` r
health <- tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)

health |> 
  group_by(smoker, .drop = FALSE) |> 
  summarize(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  )
```

```
## Warning: There were 2 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `min_age = min(age)`.
## ℹ In group 1: `smoker = yes`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
```

```
## # A tibble: 2 × 6
##   smoker     n mean_age min_age max_age sd_age
##   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
## 1 yes        0      NaN     Inf    -Inf   NA  
## 2 no         5       60      34      88   21.6
```

]
.panel[.panel-name[h2]

``` r
health |> 
  group_by(smoker, .drop = TRUE) |> 
  summarize(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  )
```

```
## # A tibble: 1 × 6
##   smoker     n mean_age min_age max_age sd_age
##   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
## 1 no         5       60      34      88   21.6
```

]
.panel[.panel-name[flights]

``` r
flights |> filter(day %% 5 == 0) %>%
  group_by(day ) |> 
  summarize(
    proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    count_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .groups = "drop"
  )  
```

```
## # A tibble: 6 × 3
##     day proportion_delayed count_long_delay
##   <int>              <dbl>            <int>
## 1     5              0.947               21
## 2    10              0.888               60
## 3    15              0.953                5
## 4    20              0.946                6
## 5    25              0.905               19
## 6    30              0.930                8
```

]
]
---
class: center, inverse, middle

# Dates
### BRING YOUR PATIENCE

---
## `lubridate` package

* commands to produce today's date (e.g. `today()` or `now()`)
* reformatting options (see table on next page and [linked here](https://r4ds.hadley.nz/datetimes.html#tbl-date-formats))

---
|  [Type](https://r4ds.hadley.nz/datetimes.html#tbl-date-formats) | Code |             Meaning            |     Example     |   |
|:-----:|:----:|:------------------------------:|:---------------:|---|
|  Year | %Y   |          4 digit year          |       2021      |   |
|       | %y   |          2 digit year          |        21       |   |
| Month | %m   |             Number             |        2        |   |
|       | %b   |        Abbreviated name        |       Feb       |   |
|       | %B   |            Full name           |     February    |   |
|  Day  | %d   |        One or two digits       |        2        |   |
|       | %e   |           Two digits           |        02       |   |
|  Time | %H   |          24-hour hour          |        13       |   |
|       | %I   |          12-hour hour          |        1        |   |
|       | %p   |              AM/PM             |        pm       |   |
|       | %M   |             Minutes            |        35       |   |
|       | %S   |             Seconds            |        45       |   |
|       | %OS  | Seconds with decimal component |      45.35      |   |
|       | %Z   |         Time zone name         | America/Chicago |   |
|       | %z   |         Offset from UTC        |      +0800      |   |
| Other | %.   |       Skip one non-digit       |        :        |   |
|       | %*   |  Skip any number of non-digits |                 |   |

---
## Formatting dates: bringing together information in tables

``` r
library(nycflights13)
flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = make_datetime(year, month, day, hour, minute))
```

```
## # A tibble: 336,776 × 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # ℹ 336,766 more rows
```

---

### Flights: using time

``` r
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights |> 
  filter(!is.na(dep_time), !is.na(arr_time)) |> 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) |> 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt |> 
  filter(dep_time < ymd(20130102)) |> 
  ggplot(aes(x = dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
```

![](index_files/figure-html/unnamed-chunk-22-1.png)

---
### Commands: `make_datetime()` components

* `year()`
* `month()`
* `mday()` (day of the month) 
* `yday()` (day of the year)
* `wday()` (day of the week)
* `hour()`
* `minute()`
* `second()`

---
## Acknowledgments

The content of these slides is derived in part from Sabrina Nardin and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.