Transform: Logic and Booleans

class: center, middle, inverse, title-slide

.title[
# Transform: Logic and Booleans
]
.author[
### MACSS 30500 <br /> University of Chicago
]

---

# Agenda:

* Comparisons
* Boolean Algebra
* Summaries
* Conditional Transformations
* Making numbers
* Counts
* Numeric transformations
* General transofrmations
* Numeric summaries

---

## Logical statements: when you want to subset in some way

* Comparisons
* Can have simple statements
* Can layer and combine
* Parentheses are your friend!

---

## Comparisons

* `<`: less than
* `<=`: less than or equal to 
* `>`: greater than
* `>=`: greater than or equal to 
* `!=`: not equal to 
* `==`: not equal to

---
### Comparisons: applications

``` r
library(nycflights13)
flights |> 
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
```

```
## # A tibble: 172,286 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      601            600         1      844            850
##  2  2013     1     1      602            610        -8      812            820
##  3  2013     1     1      602            605        -3      821            805
##  4  2013     1     1      606            610        -4      858            910
##  5  2013     1     1      606            610        -4      837            845
##  6  2013     1     1      607            607         0      858            915
##  7  2013     1     1      611            600        11      945            931
##  8  2013     1     1      613            610         3      925            921
##  9  2013     1     1      615            615         0      833            842
## 10  2013     1     1      622            630        -8     1017           1014
## # ℹ 172,276 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```

---
### Comparisons: applications
Task: filter data for over 65 and male gender:

``` r
library(stevedata)
data("anes_vote84")
```

``` r
anes_vote84 %>% filter(age >65 & female == 0)
```

```
## # A tibble: 96 × 9
##      uid stateabb  vote   age  educ female south polint govrace
##    <int> <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>
##  1    57 OR           1    75     7      0     0      0       0
##  2    68 FL           1    78     7      0     1      1       0
##  3    86 WI           1    76     1      0     0      1       0
##  4   100 <NA>         1    70     3      0     0      1       0
##  5   111 VA           0    70     1      0     1      1       0
##  6   133 TX          NA    70     5      0     1      1       0
##  7   136 KS           0    83     1      0     0      1       0
##  8   138 CA           1    67     6      0     0      1       0
##  9   147 MN           1    70     5      0     0      0       0
## 10   189 AL          NA    68     4      0     1      1       0
## # ℹ 86 more rows
```

---
# Aside: NAs
You can't use something like `==NA`

* `is.na()`: will return T or F
* `!is.na()`: can filter for non-NA

---
# Boolean: Venn Diagrams
(more parentheses can help with the logic!)

![](https://r4ds.hadley.nz/diagrams/transform.png)

---
## Boolean

* `&`: and
* `!`: not
* `|`: or (upright bar)
* `%in&`: can use for lists

--
### Don'ts

* `&&` and `||`: these are going to return a single T/F

---
## Boolean examples

``` r
flights |> 
  filter(month == 1 & day == 1) |> 
  arrange(desc(is.na(dep_time)), dep_time)
```

```
## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     1      517            515         2      830            819
##  6  2013     1     1      533            529         4      850            830
##  7  2013     1     1      542            540         2      923            850
##  8  2013     1     1      544            545        -1     1004           1022
##  9  2013     1     1      554            600        -6      812            837
## 10  2013     1     1      554            558        -4      740            728
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

### Boolean ex, cont'd

``` r
library(stevedata)
data(turnips)

turnips %>% filter(price > 100 & price < 200 & time != "12:00 p.m.")
```

```
## # A tibble: 244 × 3
##    date       time      price
##    <date>     <chr>     <dbl>
##  1 2021-04-16 8:00 a.m.   171
##  2 2021-04-18 5:00 a.m.   109
##  3 2021-04-23 8:00 a.m.   126
##  4 2021-04-24 8:00 a.m.   184
##  5 2021-04-25 5:00 a.m.   106
##  6 2021-05-04 8:00 a.m.   131
##  7 2021-05-05 8:00 a.m.   159
##  8 2021-05-09 5:00 a.m.   107
##  9 2021-05-11 8:00 a.m.   123
## 10 2021-05-12 8:00 a.m.   183
## # ℹ 234 more rows
```

---

### Boolean ex, cont'd

``` r
library(stevedata)
data(chile88)

chile88 %>% filter(region %in% c("C","M","N") & vote == "N")
```

```
## # A tibble: 330 × 8
##    region    pop   sex   age educ  income     sq vote 
##    <chr>   <dbl> <dbl> <dbl> <chr>  <dbl>  <dbl> <chr>
##  1 N      175000     0    29 PS      7500 -1.30  N    
##  2 N      175000     1    49 P      35000 -1.03  N    
##  3 N      175000     1    23 S      35000 -1.10  N    
##  4 N      175000     1    28 P       7500 -1.05  N    
##  5 N      175000     0    26 PS     35000 -0.786 N    
##  6 N      175000     1    24 S      15000 -1.11  N    
##  7 N      175000     0    41 P      15000 -1.30  N    
##  8 N      175000     1    20 PS     15000 -0.856 N    
##  9 N      175000     0    20 PS     35000 -0.893 N    
## 10 N      175000     0    44 PS     35000  1.17  N    
## # ℹ 320 more rows
```

---

## Logical Summaries

* `any()`: can return true if ANY item in the group returns true
* `all()`: can return true if ALL items in the group returns true

---
### Application: task

* Using the flights dataset: group by days, find days that have delays over 750 minutes and make a table.

``` r
flights %>% group_by(day) %>% filter(any(arr_delay > 750)) %>% select(day) %>% table()
```

```
## day
##     1     3     5     9    10    14    15    17    18    19    20    22    24 
## 11036 11211 10858 10857 11227 11008 11317 11222 11399 11086 11111 11345 11041 
##    27 
## 11084
```

---

# Summaries!

* These can be a great way to distill your dataframe based on some criteria
* You'll leverage your prior 'verbs' from before and (likely) your logical and boolean elements

---

## Summaries: Example

``` r
flights |> 
  group_by(year, month, day) |> 
  summarize(
    proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    count_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .groups = "drop"
  )
```

```
## # A tibble: 365 × 5
##     year month   day proportion_delayed count_long_delay
##    <int> <int> <int>              <dbl>            <int>
##  1  2013     1     1              0.939                3
##  2  2013     1     2              0.914                3
##  3  2013     1     3              0.941                0
##  4  2013     1     4              0.953                0
##  5  2013     1     5              0.964                1
##  6  2013     1     6              0.959                0
##  7  2013     1     7              0.956                1
##  8  2013     1     8              0.975                0
##  9  2013     1     9              0.986                1
## 10  2013     1    10              0.977                2
## # ℹ 355 more rows
```

---

### Summaries: cont'd

``` r
flights |> 
  group_by(day) |> 
  summarize(
    long_delay = any(arr_delay >= 750, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(long_delay == T) %>% knitr::kable()
```

| day|long_delay |
|---:|:----------|
|   1|TRUE       |
|   3|TRUE       |
|   5|TRUE       |
|   9|TRUE       |
|  10|TRUE       |
|  14|TRUE       |
|  15|TRUE       |
|  17|TRUE       |
|  18|TRUE       |
|  19|TRUE       |
|  20|TRUE       |
|  22|TRUE       |
|  24|TRUE       |
|  27|TRUE       |

---

### Summaries: Other tools:

* `if_else()`
* `case_when()`

``` r
 flights |> 
   group_by(day) |> 
   mutate(
     long_delay = if_else(arr_delay >= 750, 1,0),
     .groups = "drop"
   ) |> select(day,carrier, arr_delay,  long_delay)
```

```
## # A tibble: 336,776 × 4
## # Groups:   day [31]
##      day carrier arr_delay long_delay
##    <int> <chr>       <dbl>      <dbl>
##  1     1 UA             11          0
##  2     1 UA             20          0
##  3     1 AA             33          0
##  4     1 B6            -18          0
##  5     1 DL            -25          0
##  6     1 UA             12          0
##  7     1 B6             19          0
##  8     1 EV            -14          0
##  9     1 B6             -8          0
## 10     1 AA              8          0
## # ℹ 336,766 more rows
```

---
## Logic recap:

THINK IT THROUGH: what are you trying to get? What are your pieces? How can you break it down?

---
class: center, middle, inverse

# Numbers

---

## Important functions: 
* `parse_number()`
* `counts()`: 
  * can layer with other commands, such as wt (weighted counts)
* min/max: both `min()`/`max()` and `pmin()`/`pmax()`
* transformations: logs and rounding 
* cuts: `cut()` (note from ggplot: also `cut_interval()`, `cut_number()`, and `cut_width()`)
* rolling aggregates: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, and `cummean()`
* `rank`
* offsets: `lead()` and `lag()`
* central tendency and description: `mean()`, `median()`, `IQR()`, `sd()`, and `quantile()`

---

### Applications

`cut()`:

``` r
y <- c(NA, -10, 5, 10, 30, 1, 2, 5, 10, 15, 20)
cut(y, breaks = c(0, 5, 10, 15, 20), labels = c("sm", "md", "lg", "xl"))
```

```
##  [1] <NA> <NA> sm   md   <NA> sm   sm   sm   md   lg   xl  
## Levels: sm md lg xl
```

---
### Applications: Rank

`rank()`: how to calculate places / order

``` r
library(nycflights13)

flights %>% group_by(carrier) %>% 
  summarize(delays_d = sum(dep_delay, na.rm = T),
         delays_a = sum(arr_delay, na.rm = T)) %>% 
  mutate(best_d = rank(delays_d),
         best_a = rank(delays_a),
         worst_d = rank(desc(delays_d)),
         worst_a = rank(desc(delays_a))) %>%
  filter(best_a < 5 & best_d < 5 )
```

```
## # A tibble: 4 × 7
##   carrier delays_d delays_a best_d best_a worst_d worst_a
##   <chr>      <dbl>    <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
## 1 AS          4133    -7041      3      1      14      16
## 2 HA          1676    -2365      2      2      15      15
## 3 OO           365      346      1      3      16      14
## 4 YV         10353     8463      4      4      13      13
```
---
## Recap: SO MUCH WE CAN DO!!

* Goal is for you to understand what is possible
* Approach with a curious mind!