class: center, middle, inverse, title-slide .title[ # Introduction to Computing for the Social Sciences ] .author[ ### MACSS 30500
University of Chicago ] --- class: inverse, middle # Intro to the course --- ## Instructor - Jean Clipperton ## TA - Sarthak Dhanke: sarthakdhanke@uchicago.edu ## You --- ## Course Objectives The goal of this course is to acquire and practice **basic computational skills**. We focus on: learning R and learning practices for reproducible research. * Programming as a means to an end: analyzing social science data * You will learn fundamental computational skills and techniques, and gain the confidence necessary to learn new techniques as you encounter them in your own research --- ## Major Topics Major topics: * Basic R, Git/GitHub, R Markdown * Data Visualization and Data Transformation * Exploratory Data Analysis * Data Wrangling * R Programming: Control Structures, Functions, Data Structures * Debugging and defensive Programming * Reproducible research * Scraping * Text-Analysis --- ## How we will do this We will start small and progressively build more complex code ``` r print("Hello world!") ``` ``` ## [1] "Hello world!" ``` --- class: inverse, middle # Succeeding in the class --- ## General Expectations/Suggestions * Attend classes and be engaged * Have RStudio up * Complete the readings before coming to class * Please no abuse of emails, texts, social media, etc. during lectures * Collaborate and ask for help * Start early to work on the homework assignments * Understand our [Academic Integrity Policy](https://cfss-macss.netlify.app/syllabus/#chatgptcopilotllmsgenerative-ai) --- ## Asking for help ### 15-minute rule When stuck, you HAVE to try on your own for 15 min. After 15 min, you HAVE to ask for help ### Resources * [Google](https://www.google.com) * [StackOverflow](http://stackoverflow.com/) * Me and TA * Fellow students (e.g. study groups) * [Class discussion page on Ed Discussion](https://edstem.org/us/courses/79778/discussion) * [How to ask for help](https://cfss-macss.netlify.app/faq/asking-questions/) * ChatGPT: PROCEED WITH CAUTION --- ## Plagiarism ### Collaboration is good - *to a point* * Collaboration is good: researchers usually collaborate with one another on projects. Developers work in teams to write programs. Learning from others and/or the internet is fine (e.g., to debug something, to get unstuck on a specific part of your code, etc.) * BUT you are expected to complete your own work, understand what the code is doing and be prepared to explain it in details to someone else. * In your homework, at the top: please mention the resources you had access to complete the homework (e.g., I met with the Sabrina for the part of code ... I asked advice to my colleague name for... ) --- ## ChatGPT / AI / Generative AI / LLMs We're in a brave new world: there are many tools and options available to 'help' with assignments. We recognize this, and that while these are tempting, they are likely going to only help in the short term at the expense of long-term learning. Every single assignment needs to include a statement about how/if you used AI in the completion of the assignment (if you did not use it, you can say so; if you used other sources, you can document them). For your assignment, you need to document how you sought help, the prompts you used, and how the model helped you move forward on your assignment. In this particular class, you should aim to have ~95% of your work completed solo, with your first stop for assistance either your instructor or TA, then the web, and then ChatGPT / LLMs. --- ## Homework and Evaluation * **Regular programming assignments**: six in total (can complete the seventh to replace your lowest score). * **Final assignment**: an opportunity to bring all the skills together * **Late Work Policy**: you can submit up to two assignments up to 48 hours late without penalty; no questions asked. After 48 hours, we do not accept late work. We need to keep moving and it will negatively impact our timeline. * **Extraordinary circumstances** (illness, family emergency, etc.): I may grant additional extensions on a case-by-case basis. Whenever possible, inform me before the deadline. Please, note that having a heavy workload in a given week does not qualify as an extraordinary circumstance. The purpose of the two extensions is precisely to give you some flexibility in weeks when you are busier than usual. * **[Rubric](https://cfss-macss.netlify.app/faq/homework-evaluations/)** --- ## Software Set up Go to our course website under **[Setup](https://cfss-macss.netlify.app/setup/)** * We will follow the on-your-machine setup --- class: inverse, middle # Group chat --- # In groups of 3-4 people * Introduce yourself * Share your expectations/goals for this course * Share strategies * reflect upon the best and worst classes you have taken in this field or related fields; describe what made the course successful (or not) for you * strategies you plan to adopt to be successful in the course * anything else * Explore the website * Questions --- class: inverse, middle # Programming and reproducible workflow --- ## Program > A series of instructions that specifies how to perform a computation * Input * Output * Math * Logic (e.g. conditional execution) * Repetition --- class: middle ## Windows 3.1 GUI <!-- --> --- class: middle ## Mac OSX GUI <!-- --> --- class: middle <!-- --> --- Stata is for the most part a GUI software. A GUI software runs using all the basic programming elements, but the end user is not aware of any of this. Instructions in GUI software are implicit to the user, whereas programming requires the user to make instructions explicit. **What are the advantages of learning to program VS. using a GUI software, such as Stata?** --- ## Two different approaches **TASK**: Write a report analyzing the relationship between ice cream consumption and crime rates in New York City. **Consider**: How might two people approach this problem -- .pull-left[ ### Jane: a GUI workflow 1. Searches for data files online 1. Cleans the files in Excel 1. Analyzes the data in Stata 1. Writes her report in Google Docs ] -- .pull-right[ ### Sally: a programmatic workflow 1. Creates a folder specifically for this project * `data` * `graphics` * `output` 1. Searches for data files online 1. Cleans the files in R 1. Analyzes the files in R 1. Writes her report in R Markdown ] --- class: middle <img src="https://i.imgflip.com/1szkun.jpg" width="70%" height="70%" /> --- ## Automation * Jane forgets how she transformed and analyzed the data * Extension of analysis will fall flat * Sally uses *automation* * Re-run programs * No mistakes * Much easier to implement *in the long run* --- ## Reproducibility * Are my results valid? Can it be *replicated*? * The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them * Also allows the researcher to precisely replicate his/her analysis --- ## Version control * Revisions in research * Tracking revisions * Multiple copies * `analysis-1.r` * `analysis-2.r` * `analysis-3.r` * Cloud storage (e.g. Dropbox, Google Drive, Box) vs. Version Control Software * Repository on your computer for which the VCS tracks all changes (who, when, what, etc.). Example: Git * Usually implemented in conjunction with remote serves to store backups of your repository. Example: GitHub --- ## Documentation * *Comments* are the what * *Code* is the how * Computer code should also be *self-documenting* * Future-proofing --- ## Badly documented code ``` r library(tidyverse) library(rtweet) tml1 <- get_timeline("MeCookieMonster", 3000) tml2 <- get_timeline("Grover", 3000) tml3 <- get_timeline("elmo", 3000) tml4 <- get_timeline("CountVonCount", 3000) ts_plot(group_by(bind_rows(select(tml1, created_at), select(tml2, created_at), select(tml3, created_at), select(tml4, created_at), .id = "screen_name"), screen_name), by = "months") ``` * What does this program do? * What do the variable names mean? * What does 3000 refer to? * What are we doing with the `ts_plot()` function? --- ## Good code (hypothetical example) .tiny[ ``` r # get_to_sesame_street.R # Program to retrieve recent tweets from Sesame Street characters # load packages for data management and Twitter API library(tidyverse) library(rtweet) # retrieve most recent 3000 tweets of best Sesame Street residents cookie <- get_timeline( user = "MeCookieMonster", n = 3000 ) grover <- get_timeline( user = "Grover", n = 3000 ) elmo <- get_timeline( user = "elmo", n = 3000 ) count_von_count <- get_timeline( user = "CountVonCount", n = 3000 ) # combine, group by character, and plot weekly tweet frequency bind_rows( `Cookie Monster` = cookie %>% select(created_at), Grover = grover %>% select(created_at), Elmo = elmo %>% select(created_at), `Count von Count` = count_von_count %>% select(created_at), .id = "screen_name" ) %>% group_by(screen_name) %>% ts_plot(by = "months") ``` ] --- # Troubleshooting and let's give it a go * test out the following code: ``` r # install.packages("palmerpenguins") library(palmerpenguins) ``` ``` ## ## Attaching package: 'palmerpenguins' ``` ``` ## The following objects are masked from 'package:datasets': ## ## penguins, penguins_raw ``` ``` r library(tidyverse) ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, col = species)) + geom_point() + labs(title = "Relationship Between Flipper Size and Body Mass for 333 Palmer Penguins", x = "Body Mass (g)", y = "Flipper Length (mm)") ``` ``` ## Warning: Removed 2 rows containing missing values or values outside the scale range ## (`geom_point()`). ``` <!-- --> --- # To complete by next class * See today's topic, on our [homepage](https://cfss-macss.netlify.app) * Zoom images * [CONFIGURE GIT](https://cfss-macss.netlify.app/setup/git/git-configure/) (must be done prior to class tomorrow!) * Look over [Assignment 1](https://cfss-macss.netlify.app/assignments/1-edit-readme/) --- # Acknowledgments The content of these slides is derived in part from Sabrina Nardin's and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.