class: center, middle, inverse, title-slide .title[ # Transforming: Factors and Dates ] .author[ ### MACSS 30500
University of Chicago ] --- class: inverse, middle # R Base Data Structures <!-- notes: this is a new lecture that focuses on a general review of data structures from the original lecture ("vectors-and-iteration") I kept the vectors and lists here and I need to enlarge this lecture with more data structures I do not think I need all libraries listed there, double check also check this lecture corresponding pages on the syllabus this lecture does not have notes (content > notes) make sure to do that ahead of time to ensure things work! --> --- ### R Base Data Structures R data structures: * Vectors * Matrices * Lists * Data frames * Arrays These data structures can be organized by: - their dimensions (1d, 2d, or nd) - whether they are homogeneous (all contents must be of the same type, like atomic vectors and matrices) or heterogeneous (contents can be of different types, like lists and data frames) Please, review: * Lecture 2 (`introR_lecture`) to define, subset, and manipulate these data structures * Chapter 20 "Vectors" in R for Data Science --- ### R is fundamentally a vector-based program So far, we have been using predominantly data frames, which are very common when working with social science data. However, data frames are not actually the most fundamental type of object in R: **vectors are the ultimate building blocks of objects within R**. A matrix is made of vectors, a list is made of vectors (a list is still a vector in R but not an atomic one), data frames are made by lists, etc. Basically in R either something is a vector, or it's NULL... --- ### R is fundamentally a vector-based program <img src="https://r4ds.had.co.nz/diagrams/data-structures-overview.png" width="60%" /> <!-- focus today is on atomic vectors and lists --> --- class: inverse, middle # Atomic vectors --- ### Types of atomic vectors Remember: **All values in an atomic vector must to be of the same type**. **Logical**: you have used it every time you use a conditional test or operation (e.g., when you filter a data frame) ``` r logical_vector <- c("TRUE", "TRUE", "FALSE", "TRUE", "NA") ``` **Numeric**: can be integer or double (default) ``` r integer_vector <- c("1", "5", "3", "4", "12423") double_vector <- c("4.2", "4", "6", "53.2") ``` **Character**: note you can use single or double quotations, you just need to be consistent ``` r character_vector <- c("WOOHOO", "'1,2,3 ready!'", "R", '7/2/2025') ``` --- ### Scalars In math a scalar is defined as a single real number. R has no concept of a scalar: **in R, a scalar is simply a vector of length 1** ``` r # set up a vector x of length 10 (x <- sample(10)) ``` ``` ## [1] 10 6 5 4 1 8 2 7 9 3 ``` ``` r # add 100 to x x + c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100) ``` ``` ## [1] 110 106 105 104 101 108 102 107 109 103 ``` ``` r # add 100 to x: the R way (vector recycling) x + 100 ``` ``` ## [1] 110 106 105 104 101 108 102 107 109 103 ``` <!-- The second way to add the numbers is more efficient but can also be dangerous...--> --- ### Vector Recycling When two vectors are involved in an operation, **R repeats the elements of the shorter vector to match the length of the longer vector**. This will work for any vector of any length. For example: ``` r # x1 is sequence of numbers from 1 to 2 (x1 <- seq(from = 1, to = 2)) ``` ``` ## [1] 1 2 ``` ``` r # x2 is a sequence of numbers from 1 to 10 (x2 <- seq(from = 1, to = 10)) ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` --- ### Vector Recycling If we add `x1` and `x2` together, R will do it, but the result might not be what we expect: ``` r (x1 + x2) ``` ``` ## [1] 2 4 4 6 6 8 8 10 10 12 ``` The shorter vector, `x1`, is duplicated five times in order to match the length of the longer vector `x2.` This behavior is called **vector recycling** and happens automatically in R. You need to pay attention if this is what you intended to do. If not, extend the length of the shorter vector manually first, then add them up. Note, if the shorter vector is not a multiple of the longer one, R will print a warning message. --- ### Subsetting vectors: slicing To subset a vector we use the index location of its elements: ``` r x <- c("one", "two", "three", "four", "five") ``` ``` # keep the first element x[1] # keep the first through third elements x[c(1, 2, 3)] # long way x[1:3] # shorter x[c(seq(1, 3))] # sequence x[-c(4:5)] # negative indexing (values that you do not want to keep) x[-c(4,5)] # negative indexing x[c(-1,2,3)] # error! do not mix negative and positive subscripts ``` --- ### Subset with a logical vector: conditional subsetting Sometimes, rather than slicing, like we did in the previous example, we want to keep certain values based of a **condition**. This is more similar to a filtering operation (vs. slicing) and it is a 2-step operation: 1. create a logical vector of TRUEs and FALSEs, that identifies for each of the elements in the original vector, which one we want to keep 2. apply that vector to the vector we want to subset --- ### Subset with a logical vector: conditional subsetting Given a vector `x`: ``` r x <- c(NA, 10, 3, 5, 8, 1, NA) ``` We want to keep all the non-missing values in `x`. To find them we can use `is.na()`. This function outputs a logical vector of TRUEs and FALSEs. Notice the `!` reverts the output, so we get TRUE for non-missing and FALSE for missing values -- we want to keep the TRUEs: ``` r !is.na(x) ``` ``` ## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE ``` Then, we put the function into `[]` to apply it to our `x` vector. This says "keep all elements that are TRUE" in this vector: ``` r x[!is.na(x)] ``` ``` ## [1] 10 3 5 8 1 ``` --- ### Subset with a logical vector: conditional subsetting This applies to any kind of conditional test. For example, given the same vector `x`: ``` r x <- c(NA, 10, 3, 5, 8, 1, NA) ``` We might want to get all even or missing values of `x`. To do so, we first use modular division: ``` r x %% 2 == 0 ``` ``` ## [1] NA TRUE FALSE FALSE TRUE FALSE NA ``` Then, we apply it to our vector `x`: ``` r x[x %% 2 == 0] ``` ``` ## [1] NA 10 8 NA ``` --- class: inverse, middle # Lists --- ## Lists Lists are another type of vector, but they are not atomic vector. They differ from atomic vectors in two main ways: 1. They **store heterogeneous elements** (vs. all values in an atomic vector must be of the same type) 2. They **are structured differently** and are created with the `list()` function, not with the `c()` function. Notice the output is different than the output from an atomic vector: ``` r x <- list(1, 2, 3) x ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 2 ## ## [[3]] ## [1] 3 ``` --- ## Lists structure List objects are structured as a list of **independent elements**. Use `str()` to see their structure: ``` r x <- list(1, 2, 3) str(x) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 2 ## $ : num 3 ``` Here we have a list of length 3, and each of the elements of this list is a numeric atomic vector of length 1. --- ## Lists elements Unlike atomic vectors, lists can contain **multiple data types**, and we can also name each of them: ``` r x_named <- list(a = "abc", b = 2, c = c(1, 2, 3)) str(x_named) ``` ``` ## List of 3 ## $ a: chr "abc" ## $ b: num 2 ## $ c: num [1:3] 1 2 3 ``` Here we have a list of length 3, and each of the elements of this list is a different object: we have a character vector of length 1, one numeric vector of length 1, and one numeric vector of length 3. --- ## Nested lists You can also store lists inside a list: **nested list structure**. In this object `z` we have two lists: ``` r z <- list(list(1, 2), list(3, 4)) str(z) ``` ``` ## List of 2 ## $ :List of 2 ## ..$ : num 1 ## ..$ : num 2 ## $ :List of 2 ## ..$ : num 3 ## ..$ : num 4 ``` This is often useful when you interact with API to get data from the web (frequently you get this type of nested list as output when you get data from API). --- ## Secret lists: data frames! Notice, we have been using lists extensively in the class. Each column of a data frame is a list: ``` r library(stevedata) str(chile88) ``` ``` ## tibble [2,700 × 8] (S3: tbl_df/tbl/data.frame) ## $ region: chr [1:2700] "N" "N" "N" "N" ... ## $ pop : num [1:2700] 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ... ## $ sex : num [1:2700] 0 0 1 1 1 1 0 1 1 0 ... ## $ age : num [1:2700] 65 29 38 49 23 28 26 24 41 41 ... ## $ educ : chr [1:2700] "P" "PS" "P" "P" ... ## $ income: num [1:2700] 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ... ## $ sq : num [1:2700] 1.01 -1.3 1.23 -1.03 -1.1 ... ## $ vote : chr [1:2700] "Y" "N" "Y" "N" ... ``` The only difference between data frames and actual lists is that the length of each list object in the data frame has to the same (a data frame is rectangular). --- ## Subsetting lists Lists have a more complex structure than vectors, thus subsetting them also requires more attention. .pull-left[ For example, `a` is a list that contains four elements: * a numeric vector * a character vector * a numeric vector * a list object which in turns contains two distinct numeric vectors (notice the space in the middle) ] .pull-right[ <img src="lists-subsetting-a-only.png" width="30%" /> ] --- ## Subsetting lists <img src="https://r4ds.had.co.nz/diagrams/lists-subsetting.png" width="50%" /> --- class: inverse, middle # Factors --- ## Factors * Used for **categorical (discrete) variables** * Factors store categorical variables values as numbers rather than as characters (e.g., Likert scale) * Historically used for purposes of efficiency * Best used to sort categorical variables other than alphabetically (e.g., 1 to 5 rather than alphabetically) * `forcats` in `tidyverse` to manipulate factors --- ## Character vector Define a character vectors with four months and sort it: ``` r (x1 <- c("Dec", "Apr", "Jan", "Mar")) ``` ``` ## [1] "Dec" "Apr" "Jan" "Mar" ``` ``` r sort(x1) ``` ``` ## [1] "Apr" "Dec" "Jan" "Mar" ``` Notice the default behavior of R is sorting character vectors alphabetically. As humans, we understand that's not the a very meaningful way to sort months. Instead, we might want to sort months chronologically. To tell that to R, we need to convert them to factors. <!-- note we use sort() because this is a standalone vector, while we used arrange() when working with dataframes --> --- ### Step 1: Levels To convert a character vector to a factor, the first thing to do is to define all possible values that the variable can take. We do so by creating another character vector: ``` r month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" ) ``` --- ### Step 2: Factor We then use the `factor()` or the `parse_factor()` function to convert this character vector into a factor, and apply the given order to it: ``` r (y1 <- factor(x1, levels = month_levels)) ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` ``` r parse_factor(x1, levels = month_levels) ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` --- ### Step 3: Sort Finally, we sort the new factor vector `y1`, exactly like we did for the original character vector `x1`: ``` r # sort y1: chronologically correctly sort(y1) ``` ``` ## [1] Jan Mar Apr Dec ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` ``` r # sort x1: alphabetically sorted sort(x1) ``` ``` ## [1] "Apr" "Dec" "Jan" "Mar" ``` --- ## Different levels/labels Another situation you might encounter, is that rather than working directly with character vectors, you might find their numerical representation: ``` r (x2 <- c(12, 4, 1, 3)) ``` ``` ## [1] 12 4 1 3 ``` Define levels and labels separately: ``` r y2 <- factor(x2, levels = seq(from = 1, to = 12), labels = month_levels ) y2 ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` --- ## `forcats` package Provides a suite of tools that solve common problems with factors. Some examples include: - `fct_reorder()`: Reordering a factor by another variable - `fct_infreq()`: Reordering a factor by the frequency of values - `fct_relevel()`: Changing the order of a factor by hand - `fct_lump()`: Collapsing the least/most frequent values of a factor into “other” Documentation and Cheat Sheet: https://forcats.tidyverse.org/ <!-- show hw3 work with factors --> --- ## Forcats::gss() ``` r forcats::gss_cat ``` ``` ## # A tibble: 21,483 × 9 ## year marital age race rincome partyid relig denom tvhours ## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> ## 1 2000 Never married 26 White $8000 to 9999 Ind,near … Prot… Sout… 12 ## 2 2000 Divorced 48 White $8000 to 9999 Not str r… Prot… Bapt… NA ## 3 2000 Widowed 67 White Not applicable Independe… Prot… No d… 2 ## 4 2000 Never married 39 White Not applicable Ind,near … Orth… Not … 4 ## 5 2000 Divorced 25 White Not applicable Not str d… None Not … 1 ## 6 2000 Married 25 White $20000 - 24999 Strong de… Prot… Sout… NA ## 7 2000 Never married 36 White $25000 or more Not str r… Chri… Not … 3 ## 8 2000 Divorced 44 White $7000 to 7999 Ind,near … Prot… Luth… NA ## 9 2000 Married 44 White $25000 or more Not str d… Prot… Other 0 ## 10 2000 Married 47 White $25000 or more Strong re… Prot… Sout… 3 ## # ℹ 21,473 more rows ``` --- ### Ordering bars: ``` r gss_cat |> mutate(marital = marital |> fct_infreq() |> fct_rev()) |> ggplot(aes(x = marital)) + geom_bar() ``` <!-- --> --- ### Recoding factors: .panelset[ .panel[.panel-name[recode] ``` r gss_cat |> mutate( partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party" ) ) ``` ``` ## # A tibble: 21,483 × 9 ## year marital age race rincome partyid relig denom tvhours ## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> ## 1 2000 Never married 26 White $8000 to 9999 Independe… Prot… Sout… 12 ## 2 2000 Divorced 48 White $8000 to 9999 Republica… Prot… Bapt… NA ## 3 2000 Widowed 67 White Not applicable Independe… Prot… No d… 2 ## 4 2000 Never married 39 White Not applicable Independe… Orth… Not … 4 ## 5 2000 Divorced 25 White Not applicable Democrat,… None Not … 1 ## 6 2000 Married 25 White $20000 - 24999 Democrat,… Prot… Sout… NA ## 7 2000 Never married 36 White $25000 or more Republica… Chri… Not … 3 ## 8 2000 Divorced 44 White $7000 to 7999 Independe… Prot… Luth… NA ## 9 2000 Married 44 White $25000 or more Democrat,… Prot… Other 0 ## 10 2000 Married 47 White $25000 or more Republica… Prot… Sout… 3 ## # ℹ 21,473 more rows ``` ] .panel[.panel-name[collapse] ``` r gss_cat |> mutate( partyid = fct_collapse(partyid, "other" = c("No answer", "Don't know", "Other party"), "rep" = c("Strong republican", "Not str republican"), "ind" = c("Ind,near rep", "Independent", "Ind,near dem"), "dem" = c("Not str democrat", "Strong democrat") ) ) |> count(partyid) ``` ``` ## # A tibble: 4 × 2 ## partyid n ## <fct> <int> ## 1 other 548 ## 2 rep 5346 ## 3 ind 8409 ## 4 dem 7180 ``` ] ] --- ## Summarizing factors: `fct_lump_*` .panelset[.panel[.panel-name[`lump_n`] ``` r gss_cat |> mutate(relig = fct_lump_n(relig, n = 5)) |> count(relig, sort = TRUE) ``` ``` ## # A tibble: 6 × 2 ## relig n ## <fct> <int> ## 1 Protestant 10846 ## 2 Catholic 5124 ## 3 None 3523 ## 4 Other 913 ## 5 Christian 689 ## 6 Jewish 388 ``` ] .panel[.panel-name[`lump_min`] ``` r gss_cat |> mutate(relig = fct_lump_min(relig, min = 800, other_level = "Other")) |> count(relig, sort = TRUE) ``` ``` ## # A tibble: 4 × 2 ## relig n ## <fct> <int> ## 1 Protestant 10846 ## 2 Catholic 5124 ## 3 None 3523 ## 4 Other 1990 ``` ] .panel[.panel-name[`lump_prop`] ``` r gss_cat |> mutate(relig = fct_lump_prop(relig, prop = 0.4, )) |> count(relig, sort = TRUE) ``` ``` ## # A tibble: 2 × 2 ## relig n ## <fct> <int> ## 1 Protestant 10846 ## 2 Other 10637 ``` ] ] --- ### Factors and missingness .panelset[.panel[.panel-name[health] ``` r health <- tibble( name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"), smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")), age = c(34, 88, 75, 47, 56), ) health |> group_by(smoker, .drop = FALSE) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = sd(age) ) ``` ``` ## Warning: There were 2 warnings in `summarize()`. ## The first warning was: ## ℹ In argument: `min_age = min(age)`. ## ℹ In group 1: `smoker = yes`. ## Caused by warning in `min()`: ## ! no non-missing arguments to min; returning Inf ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning. ``` ``` ## # A tibble: 2 × 6 ## smoker n mean_age min_age max_age sd_age ## <fct> <int> <dbl> <dbl> <dbl> <dbl> ## 1 yes 0 NaN Inf -Inf NA ## 2 no 5 60 34 88 21.6 ``` ] .panel[.panel-name[h2] ``` r health |> group_by(smoker, .drop = TRUE) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = sd(age) ) ``` ``` ## # A tibble: 1 × 6 ## smoker n mean_age min_age max_age sd_age ## <fct> <int> <dbl> <dbl> <dbl> <dbl> ## 1 no 5 60 34 88 21.6 ``` ] .panel[.panel-name[flights] ``` r flights |> filter(day %% 5 == 0) %>% group_by(day ) |> summarize( proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE), count_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .groups = "drop" ) ``` ``` ## # A tibble: 6 × 3 ## day proportion_delayed count_long_delay ## <int> <dbl> <int> ## 1 5 0.947 21 ## 2 10 0.888 60 ## 3 15 0.953 5 ## 4 20 0.946 6 ## 5 25 0.905 19 ## 6 30 0.930 8 ``` ] ] --- class: center, inverse, middle # Dates ### BRING YOUR PATIENCE --- ## `lubridate` package * commands to produce today's date (e.g. `today()` or `now()`) * reformatting options (see table on next page and [linked here](https://r4ds.hadley.nz/datetimes.html#tbl-date-formats)) --- | [Type](https://r4ds.hadley.nz/datetimes.html#tbl-date-formats) | Code | Meaning | Example | | |:-----:|:----:|:------------------------------:|:---------------:|---| | Year | %Y | 4 digit year | 2021 | | | | %y | 2 digit year | 21 | | | Month | %m | Number | 2 | | | | %b | Abbreviated name | Feb | | | | %B | Full name | February | | | Day | %d | One or two digits | 2 | | | | %e | Two digits | 02 | | | Time | %H | 24-hour hour | 13 | | | | %I | 12-hour hour | 1 | | | | %p | AM/PM | pm | | | | %M | Minutes | 35 | | | | %S | Seconds | 45 | | | | %OS | Seconds with decimal component | 45.35 | | | | %Z | Time zone name | America/Chicago | | | | %z | Offset from UTC | +0800 | | | Other | %. | Skip one non-digit | : | | | | %* | Skip any number of non-digits | | | --- ## Formatting dates: bringing together information in tables ``` r library(nycflights13) flights |> select(year, month, day, hour, minute) |> mutate(departure = make_datetime(year, month, day, hour, minute)) ``` ``` ## # A tibble: 336,776 × 6 ## year month day hour minute departure ## <int> <int> <int> <dbl> <dbl> <dttm> ## 1 2013 1 1 5 15 2013-01-01 05:15:00 ## 2 2013 1 1 5 29 2013-01-01 05:29:00 ## 3 2013 1 1 5 40 2013-01-01 05:40:00 ## 4 2013 1 1 5 45 2013-01-01 05:45:00 ## 5 2013 1 1 6 0 2013-01-01 06:00:00 ## 6 2013 1 1 5 58 2013-01-01 05:58:00 ## 7 2013 1 1 6 0 2013-01-01 06:00:00 ## 8 2013 1 1 6 0 2013-01-01 06:00:00 ## 9 2013 1 1 6 0 2013-01-01 06:00:00 ## 10 2013 1 1 6 0 2013-01-01 06:00:00 ## # ℹ 336,766 more rows ``` --- ### Flights: using time ``` r make_datetime_100 <- function(year, month, day, time) { make_datetime(year, month, day, time %/% 100, time %% 100) } flights_dt <- flights |> filter(!is.na(dep_time), !is.na(arr_time)) |> mutate( dep_time = make_datetime_100(year, month, day, dep_time), arr_time = make_datetime_100(year, month, day, arr_time), sched_dep_time = make_datetime_100(year, month, day, sched_dep_time), sched_arr_time = make_datetime_100(year, month, day, sched_arr_time) ) |> select(origin, dest, ends_with("delay"), ends_with("time")) flights_dt |> filter(dep_time < ymd(20130102)) |> ggplot(aes(x = dep_time)) + geom_freqpoly(binwidth = 600) # 600 s = 10 minutes ``` <!-- --> --- ### Commands: `make_datetime()` components * `year()` * `month()` * `mday()` (day of the month) * `yday()` (day of the year) * `wday()` (day of the week) * `hour()` * `minute()` * `second()` --- ## Acknowledgments The content of these slides is derived in part from Sabrina Nardin and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.