class: center, middle, inverse, title-slide .title[ # Data wrangling: relational data ] .author[ ### MACSS 30500
University of Chicago ] --- class: inverse, middle # Data: working with data --- # Agenda * Recap * Relational data * Joins/merges --- ## Recap * Importing data * Tidying data - `pivot_longer` - `pivot_wider` - `separate` - `unite` * Writing data * Other delimiters --- ## Introduction to relational data A **relational database** can be defined as a set of multiple tables that are linked based on data common to them. <!-- Data you need for the analysis is not and cannot be stored in one single table but it is split across tables; usually two but potentially more --> -- * These tables answer research questions only when combined together. * The important elements in the analysis are not defined by individual rows or columns in one table, but they emerge from the relations between tables. --- ## Example Example of a relational database: `library(nycflights13)` from R for Data Science, Chapter 19. Five tables: * `flights` flights info * `airlines` info about the full name of airplane company, identified the career abbreviated code * `airports` info about each airport, identified by the faa airport code * `planes` info about each plane, identified by its tailnum * `weather` info about the weather at each NYC airport for each hour --- ## Example Graphical representation of the relations among the tables in `nycflights13`: <img src="https://r4ds.hadley.nz/diagrams/relational.png" width="60%" /> To understand diagrams like this, remember that each relation always concerns a **pair of tables**. Even if you have several tables in your relational database, relations are defined between a pair. --- ## Formal definitions A **KEY** of a table is a subset of columns (or attributes). There are two types of keys: * **PRIMARY KEY** uniquely identifies an observation in its own table; e.g. `tailnum` is the primary key of the `planes` table * **FOREIGN KEY** matches the primary key of another table; e.g. `tailnum` is a foreign key of the `flights` plane (it appears in the flights table where it matches each flight to a unique plane) <!-- A variable can be both a primary key and a foreign key. For example, origin is part of the weather primary key, and is also a foreign key for the airports table. --> -- A relation is defined between a **pair of tables**: a primary key and the corresponding foreign key in another table form a **relation**. <!-- --> --- ## Relations and unique identifiers Relations can be * one-to-one * one-to-many: each flight has one plane, but each plane has many flights * many-to-many You want keys to be UNIQUE identifiers: one per item to help with the merge -- ``` ## # A tibble: 0 × 2 ## # ℹ 2 variables: tailnum <chr>, n <int> ``` -- ``` ## # A tibble: 0 × 2 ## # ℹ 2 variables: tailnum <chr>, n <int> ``` --- # What if I don't have an id? You can make a key either based on a COMBINATION of variables or assign one to rows. There are a number of reasons why we might be better off with a unique identifier within the dataset, but sometimes you work with what you've got! ``` ## # A tibble: 336,776 × 20 ## id year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <int> <dbl> <int> ## 1 1 2013 1 1 517 515 2 830 ## 2 2 2013 1 1 533 529 4 850 ## 3 3 2013 1 1 542 540 2 923 ## 4 4 2013 1 1 544 545 -1 1004 ## 5 5 2013 1 1 554 600 -6 812 ## 6 6 2013 1 1 554 558 -4 740 ## 7 7 2013 1 1 555 600 -5 913 ## 8 8 2013 1 1 557 600 -3 709 ## 9 9 2013 1 1 557 600 -3 838 ## 10 10 2013 1 1 558 600 -2 753 ## # ℹ 336,766 more rows ## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- class: inverse, middle # Basic joins --- # Tools for combining tables: Mutating joins * **inner join**: keeps observations that appear in both tables * **left join**: keeps all observations in left table * **right join**: keeps all observations in right table * **full join**: keeps all observations * AKA **outer join** -- Venn diagram of mutating joins: <img src="https://r4ds.hadley.nz/diagrams/join/venn.png" width="60%" /> --- ## inner_join() Keeps observations that appear in both tables. Unmatched rows: not included in the result .pull-left[ <img src="join-setup.png" width="50%" /> ] -- .pull-right[ <img src="join-inner.png" width="120%" /> ``` inner_join(x, y, by = "key") # with pipes x %>% inner_join(y, by = "key") # if two cols do not have same name inner_join(x, y, by = c("a" = "b")) ``` ] <!-- by convention, x is assigned as the first dataframe or left one, and y as the second or right one; the by argument specifies that we are joining it based on the key column (which you cannot see from the drawing but its the column name of the colored columns in each x and y). Compare this to the left_join() operation which is another form of mutating join --> --- ## left_join() Keeps all observations in the left table (x), even if there is not a match in y .pull-left[ <img src="join-setup.png" width="50%" /> ] .pull-right[ <img src="join-outer-left.png" width="100%" /> ``` left_join(x, y, by = "key") ``` ] --- ## right_join() Keeps all observations in the right table (y), even if there is not a match in x .pull-left[ <img src="join-setup.png" width="50%" /> ] .pull-right[ <img src="join-outer-right.png" width="100%" /> ``` right_join(x, y, by = "key") ``` ] <!-- same thing as left join but reversing the order of the data frame or table typically right join is utilized less because by convention we think as the primary data for these kind of operations as the left or x table --> --- ## full_join() Keeps all observations, matches and non-matches <!-- more missing values --> .pull-left[ <img src="join-setup.png" width="50%" /> ] .pull-right[ <img src="join-outer-full.png" width="100%" /> ``` full_join(x, y, by = "key") ``` ] --- ### Extended practice: flights Suppose we want to fully explore how different this might be. First, to maximize the differences, we're going to look at a subest of our original data. ``` r flights2 <- flights %>% filter(dep_delay > 60) airlines2<- airlines %>% filter(!(carrier %in% c("AA", "US", "VX"))) dim(flights2) ``` ``` ## [1] 26581 19 ``` --- # Join comparision Recall that we are starting with 26581 x 19 ``` r flights2 |> inner_join(airlines2) |> dim() ``` ``` ## [1] 23449 20 ``` ``` r flights2 |> right_join(airlines2) |> dim() ``` ``` ## [1] 23449 20 ``` ``` r flights2 |> left_join(airlines2) |> dim() ``` ``` ## [1] 26581 20 ``` ``` r flights2 |> full_join(airlines2) |> dim() ``` ``` ## [1] 26581 20 ``` --- # Join comparision, cont'd Recall that we are starting with 26581 x 19 ``` r flights2 |> left_join(airlines2) %>% select(name) %>% table(useNA = "always") ``` ``` ## name ## AirTran Airways Corporation Alaska Airlines Inc. ## 314 39 ## Delta Air Lines Inc. Endeavor Air Inc. ## 2651 1966 ## Envoy Air ExpressJet Airlines Inc. ## 1996 6861 ## Frontier Airlines Inc. Hawaiian Airlines Inc. ## 73 10 ## JetBlue Airways Mesa Airlines Inc. ## 4571 79 ## SkyWest Airlines Inc. Southwest Airlines Co. ## 4 1061 ## United Air Lines Inc. <NA> ## 3824 3132 ``` --- class: inverse, middle # Moving to more advanced options for combining tables: Filtering joins --- ## Filtering joins - **semi_join**: keeps all observations in x that have a match in y - **anti_join** drops all observations in x that have a match in y Essentially the filtering operation uses information from the second data frame (y) to filter the first data frame (x). --- ## semi_join() Keeps all observations in x that have a match in y. Only keeps columns from x .pull-left[ <img src="join-setup.png" width="50%" /> ] .pull-right[ <img src="join-semi.png" width="100%" /> ``` semi_join(x, y, by = "key") ``` ] --- ## anti_join() Drops all observations in x that have a match in y. Only keeps columns from x .pull-left[ <img src="join-setup.png" width="50%" /> ] .pull-right[ <img src="join-anti.png" width="100%" /> ``` anti_join(x, y, by = "key") ``` ] --- # SO WHAT??? You will use the left, inner, right, and anti joins a fair amount (most likely) in your work. For the filtering, you can think of it as using a second/outside dataset to let you know what is/not needed. Example: you have a list of governments and a list of sanctions for many sanctions/countries. You can use a filtering join to see who is in the sanction list that is in your country list and an anti-join to see who on your list does NOT have a sanction. --- class: center # Best practices: coding <img src="http://www.phdcomics.com/comics/archive/phd101212s.gif" width="35%" /> --- class: inverse # Takeaways ## Ch 6: Scripts and Projects - Use projects when possible - Don't save your workspace - Use good (informative!) file names -- ## Ch 8: Getting help - Reproducible examples when you are stuck - GOOGLE: stack overflow (plus posting on Ed Discussion!) --- ## Acknowledgments The content of these slides is derived in part from Sabrina Nardin and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.