Data wrangling: import and tidy data in R

.title[
# Data wrangling: import and tidy data in R
]
.author[
### MACSS 30500 <br /> University of Chicago
]

---

# Agenda:

* Importing (Ch 7)
* Exporting data (writing to file) (Ch 7)
* Tidy data (intro / beginning of a long, happy relationship)

---

# Importing data in R

---

## base R VS `readr`

To load data into R we need importing functions.

There are a number of them depending on the type of file we want to import (see Chapter 7 of R for Data Science for details).

The most common importing functions are those that read **comma delimited files**. There are two versions of them:

- **base R**: `read.csv()`

- **`readr` package**: `read_csv()`

> They are similar, in that they both import comma delimited files, but one comes from base R, while the other comes from the newest `readr` package (part of the `tidyverse`, like `ggplot2` and `dplyr`). We focus on `read_csv`

---

## `read_csv()`

The `read_csv()` function takes several arguments, see https://readr.tidyverse.org/reference/read_delim.html. For example:

```
read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"))
```
--

The `file` argument must specified, the other arguments can be left as default:

```
library(readr)

# load data into my local R Studio
test <- read_csv(file = "/Users/jclip/Desktop/testdata.csv")

# trick if you're not sure where it is / want to select it
test <- read_csv(file = file.choose())
```

---

## Practice

1. Create a `testdata.csv` file on your desktop, and save it as plain `csv` [Try to have at least three columns, including integers, characters, and maybe a factor/categorical variable. Include at least one NA.]

1. Look at your current directory by typing `getwd()` in the console. That's where R looks at files by default

1. Load the data into R using the `read_csv()` function. Make sure to specify the correct path

1. Modify the `read_csv()` function arguments: set `col_names = FALSE` (default is TRUE). What happens?

---

### Modify `read_csv()` arguments: `col_types`

```
read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"))
```

``` r
# set col_types option 1
test <- read_csv("test.csv", #note: if I don't specify the location, it assumes my working directory
                 col_types = cols(
                   num_teas = col_integer(),
                   days = readr::col_factor( #ASK ME ABOUT THIS!!
                     c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
                     ),
                   favorite = col_character()
                   )
                 )
```

```
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
```

``` r
test
```

```
## # A tibble: 7 × 3
##   num_teas days      favorite
##      <int> <fct>     <chr>   
## 1        5 Monday    matcha  
## 2        3 Tuesday   black   
## 3        2 Wednesday green   
## 4        4 Thursday  chai    
## 5        5 Friday    black   
## 6       NA Saturday  matcha  
## 7       NA Sunday    <NA>
```

``` r
# set col_types option 2
test <- read_csv("test.csv",
                 col_types = ("ifc") #i(integer)f(factor)c(character)
                 )
```

```
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
```

``` r
test
```

---

### Modify `read_csv()` arguments: `na`

```
read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"))
```

Load the data and leave the `na` argument as default. Check your loaded data and notice what happened to the missing values.

Load the data again, but this time modify the `na` argument as follows:

```
test <- read_csv("/Users/jeanclipperton/Library/CloudStorage/Box-Box/Teaching/CFSS/course-site/static/slides/data-wrangling-tidy-data/test.csv",
                na = c("na", "NA")
                 )
```

Check your loaded data and notice what happened to the missing values. Then modify the `na` argument again using the code below, and check:

---

## Other file formats

The `readr` package and other packages include several functions to load almost all possible file formats that you might encounter (when given an option though, choose a csv over other formats). For example:

* **Comma separated csv** use `read_csv()` from the `readr` package
* **Semi column separated csv** use `read_csv2()`from the `readr` package
* **Tab separated files** use `read_tsv()`from the `readr` package
* **RDS** use `readRDS()` or `read_rds()`
* **Excel** use `read_excel()` from the `readxl` package
* **SPSS/Stata** use the`haven` package (several functions)

Cheat Sheet `readr` and `readxl`:
**Help > Cheat Sheets > Browse Cheat Sheets**

---

## `readxl`

From the `readxl` package https://readxl.tidyverse.org/, use `read_excel()` to import excel files:

``` r
library(readxl)
xlsx_example <- readxl_example(path = "datasets.xlsx")
read_excel(path = xlsx_example)
```

```
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ℹ 22 more rows
```

---

## `readxl`

We can specify the specific worksheet by name or position

``` r
excel_sheets(path = xlsx_example)
```

```
## [1] "mtcars"   "chickwts" "quakes"
```

``` r
read_excel(path = xlsx_example, sheet = "chickwts")
```

```
## # A tibble: 71 × 2
##    weight feed     
##     <dbl> <chr>    
##  1    179 horsebean
##  2    160 horsebean
##  3    136 horsebean
##  4    227 horsebean
##  5    217 horsebean
##  6    168 horsebean
##  7    108 horsebean
##  8    124 horsebean
##  9    143 horsebean
## 10    140 horsebean
## # ℹ 61 more rows
```

---

## `haven` and SAS

``` r
library(haven)

read_sas(data_file = system.file("examples", "iris.sas7bdat",
  package = "haven"
))
```

```
## # A tibble: 150 × 5
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows
```

---

## `haven` and SPSS

``` r
read_sav(file = system.file("examples", "iris.sav",
  package = "haven"
))
```

```
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##           <dbl>       <dbl>        <dbl>       <dbl> <dbl+lbl> 
##  1          5.1         3.5          1.4         0.2 1 [setosa]
##  2          4.9         3            1.4         0.2 1 [setosa]
##  3          4.7         3.2          1.3         0.2 1 [setosa]
##  4          4.6         3.1          1.5         0.2 1 [setosa]
##  5          5           3.6          1.4         0.2 1 [setosa]
##  6          5.4         3.9          1.7         0.4 1 [setosa]
##  7          4.6         3.4          1.4         0.3 1 [setosa]
##  8          5           3.4          1.5         0.2 1 [setosa]
##  9          4.4         2.9          1.4         0.2 1 [setosa]
## 10          4.9         3.1          1.5         0.1 1 [setosa]
## # ℹ 140 more rows
```

---

## `haven` and Stata

``` r
read_dta(file = system.file("examples", "iris.dta",
  package = "haven"
))
```

```
## # A tibble: 150 × 5
##    sepallength sepalwidth petallength petalwidth species
##          <dbl>      <dbl>       <dbl>      <dbl> <chr>  
##  1        5.10       3.5         1.40      0.200 setosa 
##  2        4.90       3           1.40      0.200 setosa 
##  3        4.70       3.20        1.30      0.200 setosa 
##  4        4.60       3.10        1.5       0.200 setosa 
##  5        5          3.60        1.40      0.200 setosa 
##  6        5.40       3.90        1.70      0.400 setosa 
##  7        4.60       3.40        1.40      0.300 setosa 
##  8        5          3.40        1.5       0.200 setosa 
##  9        4.40       2.90        1.40      0.200 setosa 
## 10        4.90       3.10        1.5       0.100 setosa 
## # ℹ 140 more rows
```

---

# Exporting Data from R

---

## `write_csv()`

Similar to the `read_csv()` function used for reading in csv files into R, there is a `write_csv()` function that **generates csv files** from R data frames.

Documentation: https://readr.tidyverse.org/reference/write_delim.html

```
# import
test <- read_csv(file = "/Users/jeanclipperton/Library/CloudStorage/Box-Box/Teaching/CFSS/course-site/static/slides/data-wrangling-tidy-data/test.csv)

# export
write_csv(test, file = "/Users/jeanclipperton/Library/CloudStorage/Box-Box/Teaching/CFSS/course-site/static/slides/data-wrangling-tidy-data/testdata_cleaned.csv")
```

---

# Tidy data

---

## Tidy data

.footnote[Illustrations from the [Openscapes](https://www.openscapes.org/) blog [*Tidy Data for reproducibility, efficiency, and collaboration*](https://www.openscapes.org/blog/2020/10/12/tidy-data/) by Julia Lowndes and Allison Horst)]

<!-- tidy data is a way of standardizing info in a dataframe
but it is not the only way and we are going to see some examples
the opposite of tidy would be messy data or untidy 
the reason why tidy data is popular is because provides a STANDARDIZED form
all packages we have learned so far ggplot, dplyr work with tidy data
which means you can simply load the dataset and start working on it
without reshaping it or cleaning it up (if tidy)
so point here: as soon as you get data and you know u want to work on them
within the tidyverse (ggplot, dplyr etc) get them in a tidy format first, 
then focus with the analyses or anything else u want to do!

NB: this also means if you are working outside tidyverse, in another package 
or basic R you might not need tidy data

are there ? on the tidy structure, i wanna make sure you konw not only 
the definition but also why we emphasize it so much here
-->

---

## Tidy data

---

## Tidy data

---

## Common tidying tasks

* Pivoting
    * Longer
    * Wider
* Separating
* Uniting

We are going to illustrate these tasks with datasets from the readings (chapter 5 R for Data Science). Each dataset shows the same values of four variables country, year, population, and cases, but each dataset organises the values in a different way.

Remember the tidy data principles:
- Each variable must have its own column
- Each observation must have its own row
- Each value must have its own cell

---

## Pivot longer

Look at this dataset. Why is it messy/untidy?

``` r
library(tidyverse)
table4a
```

```
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```

"Each variable must have its own column". Thus, the column names should be names of variables. Instead, here they are values of a variable: 1999 and 2000 are values of the year variable

"Each observation must have its own row". Here we have one row for every country, but that's not sufficient because this is panel data. We should have the country-year pair to define one observation, rather than only country.

---

## Pivot longer

``` r
table4a
```

```
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```

]

``` r
pivot_longer(
  data = table4a,
  cols = c(`1999`, `2000`),
  names_to = "year",
  values_to = "cases"
)
```

```
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
```

]

---

## Pivot wider

Look at this dataset. Why is it messy/untidy?

``` r
library(tidyverse)
table2
```

```
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
```

]

"Each variable must have its own column". The current values of the type        column are not values but are variables names. 
  
"Each observation must have its own row". Here an observation is scattered across multiple rows: an observation is a country in a year, but each observation is spread across two rows.

]

---

## Pivot wider

``` r
table2
```

]

``` r
pivot_wider(
  data = table2,
  names_from = type,
  values_from = count
)
```

```
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
```

]

---

## Separating

Look at this dataset. Why is it messy/untidy?

``` r
table5
```

```
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
```

---

## Separating

``` r
table3
```

```
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
```

]

``` r
separate(
  data = table3,
  col = rate,
  into = c(
    "cases",
    "population"
  ),
  convert = TRUE
)
```

```
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
```

]

---

## Uniting

Look at this dataset. Why is it messy/untidy?

``` r
table5
```

---

## Uniting

``` r
table5
```

]

``` r
unite(
  data = table5,
  col = "year",
  century, year
)
```

```
## # A tibble: 6 × 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583
```

]

---

## Uniting

``` r
table5
```

]

``` r
unite(
  data = table5,
  col = "year",
  century, year,
  # remove underscore
  sep = ""
)
```

```
## # A tibble: 6 × 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583
```

]

---

## Uniting

``` r
table5
```

]

``` r
unite(
  data = table5,
  col = "year",
  century, year,
  # remove underscore
  sep = ""
) %>%
  # store as numeric
  mutate(year = parse_number(year))
```

---
class: center

# Let's get messy!

[go to our practice repo for today](https://classroom.github.com/a/7WcsKaTn)

---

## Acknowledgments

The content of these slides is derived in part from Sabrina Nardin and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.

I also used our relevant textbook chapter for ideas to riff on [for our in-class example](https://classroom.github.com/a/7WcsKaTn)