class: center, middle, inverse, title-slide .title[ # Visualizations and the Grammar of Graphics ] .author[ ### MACSS 30500
University of Chicago ] --- # WELCOME! Agenda: * Intro / recap * Grammar of Graphics * Succeeding in this class and getting started --- class: middle # Intro: recap **Professor: Jean Clipperton,** clipperton@uchicago.edu, OH: Mon/T/W/Th 1:00-2:00 on zoom **TA: Sarthak Dhanke,** sarthakdhanke@uchicago.edu, OH:daily, 4:00pm-5:30pm on zoom --- # Working with R: * RStudio makes life nicer * Setup: projects and connecting to GitHub * Need to download packages you want to be able to use (`install.packages('package_name')`) * Need to call the package when you want to use it (each session) --- # Tour of RStudio * Become familiar with RStudio * Loading packages vs installing packages * Various panes / information --- ## Markdown: md_document vs html_document * How the final version is formatted * Output: html is 'prettier' (use for reports or making a webpage) but markdown renders nicely on github * You can switch back and forth! --- # Practice with git * Accept this assignment [https://classroom.github.com/a/A_UexHRZ](https://classroom.github.com/a/A_UexHRZ) * go to your new repo * copy the link * in terminal/shell (go to where you want): git clone (url) --- # Terminal/shell [see class page here](https://cfss-macss.netlify.app/setup/shell/) * key commands: * pwd (print working directory) * ls (I remember as 'list stuff') * cd (change directory) * cd .. (moves you one level up) * cd [tab] can autofill --- # In RStudio! * Accept this assignment [https://classroom.github.com/a/A_UexHRZ](https://classroom.github.com/a/A_UexHRZ) * go to your new repo * copy the link * In RStudio: (top left) Go to File > New Project > Version Control > Git > (navigate to where you want, and add the relevant details -- folder name, git url!) --- # Workflow: what is workflow? * 'pattern' of work -- how you go about setting things up * every assignment (and future project!) will have a similar pattern: * 'accept' the assignment on github * go to your new repo * copy the repo link * in terminal navigate to where you want, then: git clone (url) --- class: center, inverse, middle # Getting started with ggplot2 --- # R: advantage: plotting * When working with data, you can visualize and summarize it to get a deeper sense of what's going on. <img src="index_files/figure-html/unnamed-chunk-1-1.png" width="50%" /> --- class: inverse, middle # Visualizing --- class: center, middle Consider the following 13 datasets: <table> <thead> <tr> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> ID </th> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> `\(N\)` </th> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> `\(\bar{X}\)` </th> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> `\(\bar{Y}\)` </th> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> `\(\sigma_{X}\)` </th> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> `\(\sigma_{Y}\)` </th> <th style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> `\(R\)` </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26610 </td> <td style="text-align:right;"> 47.83472 </td> <td style="text-align:right;"> 16.76983 </td> <td style="text-align:right;"> 26.93974 </td> <td style="text-align:right;"> -0.0641284 </td> </tr> <tr> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 2 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 142 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 54.26873 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 47.83082 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 16.76924 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 26.93573 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> -0.0685864 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26732 </td> <td style="text-align:right;"> 47.83772 </td> <td style="text-align:right;"> 16.76001 </td> <td style="text-align:right;"> 26.93004 </td> <td style="text-align:right;"> -0.0683434 </td> </tr> <tr> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 4 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 142 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 54.26327 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 47.83225 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 16.76514 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 26.93540 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> -0.0644719 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26030 </td> <td style="text-align:right;"> 47.83983 </td> <td style="text-align:right;"> 16.76774 </td> <td style="text-align:right;"> 26.93019 </td> <td style="text-align:right;"> -0.0603414 </td> </tr> <tr> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 6 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 142 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 54.26144 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 47.83025 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 16.76590 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 26.93988 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> -0.0617148 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26881 </td> <td style="text-align:right;"> 47.83545 </td> <td style="text-align:right;"> 16.76670 </td> <td style="text-align:right;"> 26.94000 </td> <td style="text-align:right;"> -0.0685042 </td> </tr> <tr> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 8 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 142 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 54.26785 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 47.83590 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 16.76676 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 26.93610 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> -0.0689797 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26588 </td> <td style="text-align:right;"> 47.83150 </td> <td style="text-align:right;"> 16.76885 </td> <td style="text-align:right;"> 26.93861 </td> <td style="text-align:right;"> -0.0686092 </td> </tr> <tr> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 10 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 142 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 54.26734 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 47.83955 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 16.76896 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 26.93027 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> -0.0629611 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26993 </td> <td style="text-align:right;"> 47.83699 </td> <td style="text-align:right;"> 16.76996 </td> <td style="text-align:right;"> 26.93768 </td> <td style="text-align:right;"> -0.0694456 </td> </tr> <tr> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 12 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 142 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 54.26692 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 47.83160 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 16.77000 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> 26.93790 </td> <td style="text-align:right;background-color: rgba(128, 0, 0, 255) !important;"> -0.0665752 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 54.26015 </td> <td style="text-align:right;"> 47.83972 </td> <td style="text-align:right;"> 16.76996 </td> <td style="text-align:right;"> 26.93000 </td> <td style="text-align:right;"> -0.0655833 </td> </tr> </tbody> </table> --- class: center, middle If we estimate linear regression models for each dataset, we obtain virtually identical coefficients, again suggesting the relationships are identical <img src="index_files/figure-html/datasaurus-lm-1.png" width="50%" /> --- class: center, middle But what happens if we draw a picture? <img src="index_files/figure-html/datasaurus-graph-1.gif" width="60%" /> --- class: center, middle These 13 datasets have the same summary statistics, yet they are drastically different in appearance! <img src="index_files/figure-html/datasaurus-graph-static-1.png" width="60%" /> --- class: inverse, middle # The Grammar of Graphics --- ## Grammar and Grammar of Graphics > A **Grammar** can be broadly defined as the whole system and structure of a language or of languages in general, usually taken as consisting of syntax and morphology (including inflections) and sometimes also phonology and semantics. It what makes communication possible > Applied to visualizations, a **Grammar of Graphics** is a grammar that makes it possibile to create a wide range of statistical graphics! --- ### Grammar of Graphics * A grammar used to create a wide range of statistical graphics * Grammar of graphics approach: implemented in **[`ggplot2`](https://cran.r-project.org/web/packages/ggplot2/index.html)**, a widely used graphics library for R * ggplot2 is part of the **[`tidyverse`](https://www.tidyverse.org/)** a collection of R packages designed for data science that share the same grammar and data structures. We will learn how to use multiple packages from tidyverse in this course. --- class: inverse, middle # Main components of the Grammar of Graphics > Go to "The Grammar of Graphics" notes in our website to follow along --- # Grammar of Graphics: the layer cake approach .footnote[*Thanks to Jennifer Lin for this metaphor*] * graph layer + * data layer + * label layer + * theme layer + * other layer + --- # Data: Gapminder and other data Gapminder data cover multiple countries over multiple years and include information on life expectancy and population. This is one of multiple 'sample' datasets available for R (others including iris and mtcars). It's neat because to use it, you can just call `install.packages("gapminder")` and then `data(gapminder)`. --- # Basic graph: preloaded data Let's start with mtcars (you will see MANY EXAMPLES using this on stackoverflow). In a fresh Rmd file, create a code chunk that calls `data(mtcars)` ``` r data(mtcars) head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` --- # Where we're going: graph somewhat prettiness <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="50%" /> --- # ggplot layer This layer sets up the graph itself. Note that if you just set the data, as I do here, you basically get a blank slate. -- I have told it what I want for the data and my respective axes but I haven't actually plotted anything yet! ``` r ggplot(mtcars, aes(x = mpg, y = wt)) ``` <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="50%" /> --- # graph layer Then, we just add each element with a `+` after. Here, we have a few options for how we want the "things" on the graph to appear. Option 1: points. (note, there are multiple ways we could choose to set this up). ``` r ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() ``` <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="50%" /> --- # graph layer Then, we just add each element with a `+` after. Here, we have a few options for how we want the "things" on the graph to appear. Option 2: text (note, there are multiple ways we could choose to set this up). ``` r ggplot(mtcars, aes(x = mpg, y = wt)) + geom_text(label = rownames(mtcars), check_overlap = TRUE) ``` <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="50%" /> --- # points (intermediate / advanced) layer (back to points!) Here, we can see our points colored by different criteria. ``` r ggplot(mtcars)+ geom_point(aes(x = mpg, y = wt, color = factor(cyl), shape = factor(cyl), size = hp)) ``` <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="50%" /> --- # label layer ``` r ggplot(mtcars)+ geom_point(aes(x = mpg, y = wt, color = factor(cyl), shape = factor(cyl), size = hp))+ labs(title = "Car MPG vs weight", x = 'Miles per gallon', y = 'Car weight', shape="Cylinders", size="Horsepower", color = "Cylinders", caption = "Source: mtcars") ``` <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="50%" /> --- # theme layer .pull-left[ ``` r ggplot(mtcars)+ geom_point(aes(x = mpg, y = wt, color = factor(cyl), shape = factor(cyl), size = hp))+ labs(title = "Car MPG vs weight", x = 'Miles per gallon', y = 'Car weight', shape="Cylinders", size="Horsepower", color = "Cylinders", caption = "Source: mtcars") + theme_bw() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="90%" /> ] --- # Final version! .pull-left[ ``` r ggplot(mtcars)+ geom_point(aes(x = mpg, y = wt, color = factor(cyl), shape = factor(cyl), size = hp))+ labs(title = "Car MPG vs weight", x = 'Miles per gallon', y = 'Car weight', shape="Cylinders", size="Horsepower", color = "Cylinders", caption = "Source: mtcars") + theme_bw() + scale_color_manual(values = c("black", "gray", "blue") ) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="90%" /> ] --- # Basic graph: gapminder > Which is the best name for our dataset? a) df or dat b) gapminder c) gapminder_2007 d) (something else) --- # Basic graph: gapminder (NOTE: BAD CODE (why?)) ``` r data(gapminder) # note: not necessary but this will have it show up in your environment like with 'regular' data ``` -- ### Why is this code better? ``` r data(gapminder) # note: not necessary but this will # have it show up in your environment # like with 'regular' data ``` -- What might make a good `x' or 'y'? --- # Gapminder: example <img src="index_files/figure-html/gapminder-over-time-1.gif" width="50%" /> --- # gapminder code # Required packages library(ggplot2) library(gganimate) library(gapminder) library(scales) # Animated plot code ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = country)) + geom_point(alpha = 0.5) + scale_color_manual(values = country_colors, guide = "none") + scale_size(range = c(2, 12), breaks = c(1e07, 1e08, 5e08), labels = label_comma(scale_cut = cut_short_scale())) + scale_x_log10(labels = label_dollar(scale_cut = cut_short_scale())) + labs( title = "Quality of life over time", subtitle = 'Year: {frame_time}', x = 'GDP per capita', y = 'Life expectancy', size = "Population", caption = "Source: Gapminder" ) + theme_classic() + theme(legend.position = "bottom") + transition_time(year) + ease_aes('linear') --- # Acknowledgments The content of these slides is derived in part from Sabrina Nardin's and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.