Data wrangling: relational data

---

# Data: working with data

---
# Agenda

* Recap
* Relational data
* Joins/merges

---
## Recap

* Importing data
* Tidying data
  - `pivot_longer`
  - `pivot_wider`
  - `separate`
  - `unite`
* Writing data
* Other delimiters

---

## Introduction to relational data

A **relational database** can be defined as a set of multiple tables that are linked based on data common to them.

* These tables answer research questions only when combined together.

* The important elements in the analysis are not defined by individual rows or columns in one table, but they emerge from the relations between tables.

---

## Example

Example of a relational database: `library(nycflights13)` from R for Data Science, Chapter 19.

Five tables: 
* `flights` flights info
* `airlines` info about the full name of airplane company, identified the career abbreviated code
* `airports` info about each airport, identified by the faa airport code
* `planes` info about each plane, identified by its tailnum
* `weather` info about the weather at each NYC airport for each hour

---

## Example

Graphical representation of the relations among the tables in `nycflights13`:

To understand diagrams like this, remember that each relation always concerns a **pair of tables**. Even if you have several tables in your relational database, relations are defined between a pair.

---

## Formal definitions

A **KEY** of a table is a subset of columns (or attributes). There are two types of keys:

* **PRIMARY KEY**
uniquely identifies an observation in its own table; e.g. `tailnum` is the primary key of the `planes` table

* **FOREIGN KEY**
matches the primary key of another table; e.g. `tailnum` is a foreign key of the `flights` plane (it appears in the flights table where it matches each flight to a unique plane)

A relation is defined between a **pair of tables**: a primary key and the corresponding foreign key in another table form a **relation**.

<!--

-->

---
## Relations and unique identifiers

Relations can be
* one-to-one
* one-to-many: each flight has one plane, but each plane has many flights
* many-to-many

You want keys to be UNIQUE identifiers: one per item to help with the merge
--

```
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
```

---
# What if I don't have an id?

You can make a key either based on a COMBINATION of variables or assign one to rows. There are a number of reasons why we might be better off with a unique identifier within the dataset, but sometimes you work with what you've got!

```
## # A tibble: 336,776 × 20
##       id  year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1     1  2013     1     1      517            515         2      830
##  2     2  2013     1     1      533            529         4      850
##  3     3  2013     1     1      542            540         2      923
##  4     4  2013     1     1      544            545        -1     1004
##  5     5  2013     1     1      554            600        -6      812
##  6     6  2013     1     1      554            558        -4      740
##  7     7  2013     1     1      555            600        -5      913
##  8     8  2013     1     1      557            600        -3      709
##  9     9  2013     1     1      557            600        -3      838
## 10    10  2013     1     1      558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

# Basic joins

---

#  Tools for combining tables:  Mutating joins

* **inner join**: keeps observations that appear in both tables
* **left join**: keeps all observations in left table
* **right join**: keeps all observations in right table
* **full join**: keeps all observations
  * AKA **outer join** 
--

Venn diagram of mutating joins:

---

## inner_join()

Keeps observations that appear in both tables. Unmatched rows: not included in the result

]

```
inner_join(x, y, by = "key")

# with pipes
x %>% 
  inner_join(y, by = "key")
  
# if two cols do not have same name
inner_join(x, y, by = c("a" = "b")) 
  
```

]

---

## left_join()

Keeps all observations in the left table (x), even if there is not a match in y

]

```
left_join(x, y, by = "key")
```

]

---

## right_join()

Keeps all observations in the right table (y), even if there is not a match in x

]

```
right_join(x, y, by = "key")
```

]

---

## full_join()

Keeps all observations, matches and non-matches

]

```
full_join(x, y, by = "key")
```

]

---

### Extended practice: flights

Suppose we want to fully explore how different this might be. First, to maximize the differences, we're going to look at a subest of our original data.

``` r
flights2 <- flights %>% filter(dep_delay > 60)
airlines2<- airlines %>% filter(!(carrier %in% c("AA", "US", "VX")))
dim(flights2)
```

```
## [1] 26581    19
```

---
# Join comparision
Recall that we are starting with 26581 x 19

``` r
flights2 |> inner_join(airlines2) |> dim()
```

```
## [1] 23449    20
```

``` r
flights2 |> right_join(airlines2) |> dim()
```

```
## [1] 23449    20
```

``` r
flights2 |> left_join(airlines2) |> dim()
```

```
## [1] 26581    20
```

``` r
flights2 |> full_join(airlines2) |> dim()
```

```
## [1] 26581    20
```
---
# Join comparision, cont'd
Recall that we are starting with 26581 x 19

``` r
flights2 |>
    left_join(airlines2) %>% select(name) %>% table(useNA = "always")
```

```
## name
## AirTran Airways Corporation        Alaska Airlines Inc. 
##                         314                          39 
##        Delta Air Lines Inc.           Endeavor Air Inc. 
##                        2651                        1966 
##                   Envoy Air    ExpressJet Airlines Inc. 
##                        1996                        6861 
##      Frontier Airlines Inc.      Hawaiian Airlines Inc. 
##                          73                          10 
##             JetBlue Airways          Mesa Airlines Inc. 
##                        4571                          79 
##       SkyWest Airlines Inc.      Southwest Airlines Co. 
##                           4                        1061 
##       United Air Lines Inc.                        <NA> 
##                        3824                        3132
```

---

#  Moving to more advanced options for combining tables: Filtering joins

---

## Filtering joins

- **semi_join**: keeps all observations in x that have a match in y
- **anti_join** drops all observations in x that have a match in y

Essentially the filtering operation uses information from the second data frame (y) to filter the first data frame (x).

---

## semi_join()

Keeps all observations in x that have a match in y. Only keeps columns from x

]

```
semi_join(x, y, by = "key")
```

]

---

## anti_join()

Drops all observations in x that have a match in y. Only keeps columns from x

]

```
anti_join(x, y, by = "key")
```
]

---
# SO WHAT???

You will use the left, inner, right, and anti joins a fair amount (most likely) in your work. For the filtering, you can think of it as using a second/outside dataset to let you know what is/not needed.

Example: you have a list of governments and a list of sanctions for many sanctions/countries. You can use a filtering join to see who is in the sanction list that is in your country list and an anti-join to see who on your list does NOT have a sanction.

---
class: center
# Best practices: coding

<img src="http://www.phdcomics.com/comics/archive/phd101212s.gif" width="35%" />
---
class: inverse 
# Takeaways

## Ch 6: Scripts and Projects
- Use projects when possible
- Don't save your workspace
- Use good (informative!) file names

## Ch 8: Getting help

- Reproducible examples when you are stuck
- GOOGLE: stack overflow (plus posting on Ed Discussion!)

---
## Acknowledgments

The content of these slides is derived in part from Sabrina Nardin and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.