class: center, middle, inverse, title-slide .title[ # Transform: Logic and Booleans ] .author[ ### MACSS 30500
University of Chicago ] --- # Agenda: * Comparisons * Boolean Algebra * Summaries * Conditional Transformations * Making numbers * Counts * Numeric transformations * General transofrmations * Numeric summaries --- ## Logical statements: when you want to subset in some way * Comparisons * Can have simple statements * Can layer and combine * Parentheses are your friend! --- ## Comparisons * `<`: less than * `<=`: less than or equal to * `>`: greater than * `>=`: greater than or equal to * `!=`: not equal to * `==`: not equal to --- ### Comparisons: applications ``` r library(nycflights13) flights |> filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20) ``` ``` ## # A tibble: 172,286 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 601 600 1 844 850 ## 2 2013 1 1 602 610 -8 812 820 ## 3 2013 1 1 602 605 -3 821 805 ## 4 2013 1 1 606 610 -4 858 910 ## 5 2013 1 1 606 610 -4 837 845 ## 6 2013 1 1 607 607 0 858 915 ## 7 2013 1 1 611 600 11 945 931 ## 8 2013 1 1 613 610 3 925 921 ## 9 2013 1 1 615 615 0 833 842 ## 10 2013 1 1 622 630 -8 1017 1014 ## # ℹ 172,276 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### Comparisons: applications Task: filter data for over 65 and male gender: ``` r library(stevedata) data("anes_vote84") ``` -- ``` r anes_vote84 %>% filter(age >65 & female == 0) ``` ``` ## # A tibble: 96 × 9 ## uid stateabb vote age educ female south polint govrace ## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 57 OR 1 75 7 0 0 0 0 ## 2 68 FL 1 78 7 0 1 1 0 ## 3 86 WI 1 76 1 0 0 1 0 ## 4 100 <NA> 1 70 3 0 0 1 0 ## 5 111 VA 0 70 1 0 1 1 0 ## 6 133 TX NA 70 5 0 1 1 0 ## 7 136 KS 0 83 1 0 0 1 0 ## 8 138 CA 1 67 6 0 0 1 0 ## 9 147 MN 1 70 5 0 0 0 0 ## 10 189 AL NA 68 4 0 1 1 0 ## # ℹ 86 more rows ``` --- # Aside: NAs You can't use something like `==NA` * `is.na()`: will return T or F * `!is.na()`: can filter for non-NA --- # Boolean: Venn Diagrams (more parentheses can help with the logic!)  --- ## Boolean * `&`: and * `!`: not * `|`: or (upright bar) * `%in&`: can use for lists -- ### Don'ts * `&&` and `||`: these are going to return a single T/F --- ## Boolean examples ``` r flights |> filter(month == 1 & day == 1) |> arrange(desc(is.na(dep_time)), dep_time) ``` ``` ## # A tibble: 842 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 NA 1630 NA NA 1815 ## 2 2013 1 1 NA 1935 NA NA 2240 ## 3 2013 1 1 NA 1500 NA NA 1825 ## 4 2013 1 1 NA 600 NA NA 901 ## 5 2013 1 1 517 515 2 830 819 ## 6 2013 1 1 533 529 4 850 830 ## 7 2013 1 1 542 540 2 923 850 ## 8 2013 1 1 544 545 -1 1004 1022 ## 9 2013 1 1 554 600 -6 812 837 ## 10 2013 1 1 554 558 -4 740 728 ## # ℹ 832 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### Boolean ex, cont'd ``` r library(stevedata) data(turnips) turnips %>% filter(price > 100 & price < 200 & time != "12:00 p.m.") ``` ``` ## # A tibble: 244 × 3 ## date time price ## <date> <chr> <dbl> ## 1 2021-04-16 8:00 a.m. 171 ## 2 2021-04-18 5:00 a.m. 109 ## 3 2021-04-23 8:00 a.m. 126 ## 4 2021-04-24 8:00 a.m. 184 ## 5 2021-04-25 5:00 a.m. 106 ## 6 2021-05-04 8:00 a.m. 131 ## 7 2021-05-05 8:00 a.m. 159 ## 8 2021-05-09 5:00 a.m. 107 ## 9 2021-05-11 8:00 a.m. 123 ## 10 2021-05-12 8:00 a.m. 183 ## # ℹ 234 more rows ``` --- ### Boolean ex, cont'd ``` r library(stevedata) data(chile88) chile88 %>% filter(region %in% c("C","M","N") & vote == "N") ``` ``` ## # A tibble: 330 × 8 ## region pop sex age educ income sq vote ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> ## 1 N 175000 0 29 PS 7500 -1.30 N ## 2 N 175000 1 49 P 35000 -1.03 N ## 3 N 175000 1 23 S 35000 -1.10 N ## 4 N 175000 1 28 P 7500 -1.05 N ## 5 N 175000 0 26 PS 35000 -0.786 N ## 6 N 175000 1 24 S 15000 -1.11 N ## 7 N 175000 0 41 P 15000 -1.30 N ## 8 N 175000 1 20 PS 15000 -0.856 N ## 9 N 175000 0 20 PS 35000 -0.893 N ## 10 N 175000 0 44 PS 35000 1.17 N ## # ℹ 320 more rows ``` --- ## Logical Summaries * `any()`: can return true if ANY item in the group returns true * `all()`: can return true if ALL items in the group returns true --- ### Application: task * Using the flights dataset: group by days, find days that have delays over 750 minutes and make a table. -- ``` r flights %>% group_by(day) %>% filter(any(arr_delay > 750)) %>% select(day) %>% table() ``` ``` ## day ## 1 3 5 9 10 14 15 17 18 19 20 22 24 ## 11036 11211 10858 10857 11227 11008 11317 11222 11399 11086 11111 11345 11041 ## 27 ## 11084 ``` --- # Summaries! * These can be a great way to distill your dataframe based on some criteria * You'll leverage your prior 'verbs' from before and (likely) your logical and boolean elements --- ## Summaries: Example ``` r flights |> group_by(year, month, day) |> summarize( proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE), count_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .groups = "drop" ) ``` ``` ## # A tibble: 365 × 5 ## year month day proportion_delayed count_long_delay ## <int> <int> <int> <dbl> <int> ## 1 2013 1 1 0.939 3 ## 2 2013 1 2 0.914 3 ## 3 2013 1 3 0.941 0 ## 4 2013 1 4 0.953 0 ## 5 2013 1 5 0.964 1 ## 6 2013 1 6 0.959 0 ## 7 2013 1 7 0.956 1 ## 8 2013 1 8 0.975 0 ## 9 2013 1 9 0.986 1 ## 10 2013 1 10 0.977 2 ## # ℹ 355 more rows ``` --- ### Summaries: cont'd ``` r flights |> group_by(day) |> summarize( long_delay = any(arr_delay >= 750, na.rm = TRUE), .groups = "drop" ) %>% filter(long_delay == T) %>% knitr::kable() ``` | day|long_delay | |---:|:----------| | 1|TRUE | | 3|TRUE | | 5|TRUE | | 9|TRUE | | 10|TRUE | | 14|TRUE | | 15|TRUE | | 17|TRUE | | 18|TRUE | | 19|TRUE | | 20|TRUE | | 22|TRUE | | 24|TRUE | | 27|TRUE | --- ### Summaries: Other tools: * `if_else()` * `case_when()` -- ``` r flights |> group_by(day) |> mutate( long_delay = if_else(arr_delay >= 750, 1,0), .groups = "drop" ) |> select(day,carrier, arr_delay, long_delay) ``` ``` ## # A tibble: 336,776 × 4 ## # Groups: day [31] ## day carrier arr_delay long_delay ## <int> <chr> <dbl> <dbl> ## 1 1 UA 11 0 ## 2 1 UA 20 0 ## 3 1 AA 33 0 ## 4 1 B6 -18 0 ## 5 1 DL -25 0 ## 6 1 UA 12 0 ## 7 1 B6 19 0 ## 8 1 EV -14 0 ## 9 1 B6 -8 0 ## 10 1 AA 8 0 ## # ℹ 336,766 more rows ``` --- ## Logic recap: THINK IT THROUGH: what are you trying to get? What are your pieces? How can you break it down? --- class: center, middle, inverse # Numbers --- ## Important functions: * `parse_number()` * `counts()`: * can layer with other commands, such as wt (weighted counts) * min/max: both `min()`/`max()` and `pmin()`/`pmax()` * transformations: logs and rounding * cuts: `cut()` (note from ggplot: also `cut_interval()`, `cut_number()`, and `cut_width()`) * rolling aggregates: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, and `cummean()` * `rank` * offsets: `lead()` and `lag()` * central tendency and description: `mean()`, `median()`, `IQR()`, `sd()`, and `quantile()` --- ### Applications `cut()`: ``` r y <- c(NA, -10, 5, 10, 30, 1, 2, 5, 10, 15, 20) cut(y, breaks = c(0, 5, 10, 15, 20), labels = c("sm", "md", "lg", "xl")) ``` ``` ## [1] <NA> <NA> sm md <NA> sm sm sm md lg xl ## Levels: sm md lg xl ``` --- ### Applications: Rank `rank()`: how to calculate places / order ``` r library(nycflights13) flights %>% group_by(carrier) %>% summarize(delays_d = sum(dep_delay, na.rm = T), delays_a = sum(arr_delay, na.rm = T)) %>% mutate(best_d = rank(delays_d), best_a = rank(delays_a), worst_d = rank(desc(delays_d)), worst_a = rank(desc(delays_a))) %>% filter(best_a < 5 & best_d < 5 ) ``` ``` ## # A tibble: 4 × 7 ## carrier delays_d delays_a best_d best_a worst_d worst_a ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 AS 4133 -7041 3 1 14 16 ## 2 HA 1676 -2365 2 2 15 15 ## 3 OO 365 346 1 3 16 14 ## 4 YV 10353 8463 4 4 13 13 ``` --- ## Recap: SO MUCH WE CAN DO!! * Goal is for you to understand what is possible * Approach with a curious mind!