HW06: Generating reproducible social science research

Homework 07 due XXX

Overview

Due by 11:59 pm on July 10th.

The goal of this assignment is to practice and further consolidate what we have been learning so far, by applying these skills and tools to a social science research context.

Accessing the `hw06` repository

Go to this link to accept and create your private hw06 repository on GitHub. Once you do so, your repository will be built in a few seconds. It follows the naming convention hw06-<USERNAME>
Once your repository has been created, click on the link you see, which will take you to your repository.
Finally, clone the repository to your computer (or R workbench) following the process below.

Notice the repo you clone for this assignment is empty: you will have to fill it with your data and code, and push them to your github repo.

Cloning your `hw06` repository

After you have accessed the hw06 repository (see above), follow the same steps you completed for hw01 to clone the repository.

General workflow

Your general workflow will be as follows:

Accept the repo and clone it (see above)
Make changes locally to the files in RStudio
Save your changes
Stage-Commit-Push: stage and commit your changes to your local Git repo; then push them online to GitHub. You can complete these steps using the Git GUI integrated into RStudio. In general, you do not want to directly modify your online GitHub repo (if you do so, remember to pull first); instead modify your local Git repo, then stage-commit-push your changes up to your online GitHub repo.

Assignment description

We’re nearing the end of our course (!) and I want to check and make sure everyone has acquired the major skills learned so far:

Importing and tidying data
Transforming, visualizing, and exploring data
Interpreting and communicating results
Basic programming principles (conditional statements, functions, data structures)
Debugging and defensive programming
Reproducible Workflow
Working with R Markdown documents

I want you to practice these skills by applying them to research that interests you. Therefore in this assignment, I ask you to write a short report on a research question of your own interest.

Your report should include the following:

A clearly formulated research question (or set of questions)
Use social science data and include a description of these data for the reader
Some descriptive statistics and plots of the data. You are not expected and do not need to (unless you want to) include complex statistical modeling that we have not learned. The actual analysis can be relatively simple (think exploratory!): analyzing and interpreting the distribution of variables and how they are related to one another at the bivariate level is more than adequate
Summarize the relationships you discover with a written summary. Conjecture as to why they occur and/or why they may be spurious, and further hypotheses to evaluate
The final output should be a md (i.e., github_document), word_document, pdf_document or anything else that can be viewed directly on GitHub (do not use html_document). Use of code chunk and YAML options to customize your final document. Frame your report as you would if you were submitting it for a substantive seminar in your research field or to a presentation for a company. It should be approximately 500-750 words in length. It is fine if the report is longer (up to 1000 words), but it should not be less than 500.
- You do not need to stuff everything into the final document. Think of this like a traditional report. You might describe how you obtained and prepared the data, but you won’t include all the code and output from that process in the final document. For example: if you write several lines of code for data cleaning and tidying, you can write that code in a separate R script (and load into the .Rmd only the cleaned and tidy data); or, if you create multiple plots, you can write the code in a R script, and select only a few plots for the report, etc. As long as all code is stored in the repo, you ensure reproduciblility (refer to the extra code in the README). We will access and grade everything that is included in the repo.

In terms of code, your submission should contain the following:

Code to load the required libraries and import the data using a relative path. This code needs to be at the top of the document only (do not import libraries or load data in other parts of the script)
Code to clean the data (e.g., rename variables, recode variables, etc.) as necessary and/or code to tidy the data as necessary (this will vary from project to project)
Code to generate descriptive statistics (tables, summaries, etc.) and plots (roughly 3-4 well designed plots, I expect more if your data require only minimal data cleaning/tidying)
Your code should incorporate programming techniques such as loops or alternatives, if-else statements, and user-defined functions
Your code should take into account all topics we learned, including coding style and defensive programming, reproducible workflow, and use of Rmd documents (I expect you to consult the slides and the assigned readings, for example on how to generate graphics for communication, etc.)
Include relevant output, but omit lengthy, irrelevant, and hard-to-read output (e.g., include the head of the data set but do not include the full print of it)
Your code should be appropriately commented (not underly and not overly commented)

Your repository should include an informative `README.md` file that:

Explains the purpose of the repository
Lists all scripts contained in the repo and explains how to execute them to produce the same results as you
Lists any additional packages a user should be expected to install prior to executing the files (you don’t need to specify basic packages like dplyr, tidyverse, rmarkdown, etc.)
Provides any other relevant information that the user needs to know in order to use your repo and replicate your results
Provides 1-2 paragraphs of reflections on what was hard/easy about this homework, what was enjoyable, problems you solved and how you solved them, helpful resources, etc.

What data should I use?

Whatever you want! The important thing is that your assignment shows the skills you learned so far in the class, and it is useful for you. Please notice, the entire analysis must be reproducible; that is, we will clone your repository on our computers and attempt to reproduce your results.

I can’t think of anything to analyze!

You can analyze one of the datasets we have used in the class, as long as they contain social science data (e.g., cars, diamonds, or other teaching-related datasets won’t be appropriate) and as long as we have not used them for past homework assignments. For example you could use:

gapminder in library(gapminder)
gun_deaths
- In library(rcis)
- Raw data for gun_deaths from FiveThirtyEight
scorecard
- In library(rcis)
- Use the rscorecard library to download your own subset of the Department of Education’s College Scorecard data
Check out this archive of datasets from the Data Is Plural Newsletter
Check out #TidyTuesday, a weekly data analysis challenge for individuals to practice and develop their data analysis skills in R. They post a new challenge every Tuesday, and publish a complete archive of all of their past challenges and source data.

How can I download the data

There are functions in R and programs for the shell that allow you to do automatically download data. If you are using one of the data sets already provided to you in library(rcis), you do not need to download them again. But you can. For example, to download gapminder from the original GitHub repo, there are three options:

Option 1: via an R script using downloader::download or RCurl::getURL.

downloader::download("https://raw.githubusercontent.com/fivethirtyeight/guns-data/master/full_data.csv", "gun_deaths.csv")
cat(file = "gun_deaths.csv",
RCurl::getURL("https://raw.githubusercontent.com/fivethirtyeight/guns-data/master/full_data.csv"))

Option 2: in a shell script using curl or wget.

curl -O https://raw.githubusercontent.com/fivethirtyeight/guns-data/master/full_data.csv
wget https://raw.githubusercontent.com/fivethirtyeight/guns-data/master/full_data.csv

Option 3: manually download and save a copy of the data file(s) in your repo. Make sure to commit and push them to GitHub.

What if my data file is large?

Because of how Git tracks changes in files, GitHub will not allow you to commit and push a file larger than 100mb. If you try to do so, you will get an error message and the commit will not push. Worse yet, you now have to find a way to strip all trace of the data file from the Git repo (including previous commits) before you can sync up your fork. This is a pain. Avoid it as much as possible.

If you follow options 1 and 2 above, then you do not need to store the data file in the repo because it is automatically downloaded by your script/Rmd document.

If you have to store a large data file in your repo, use Git Large File Storage. It is a separate program you need to install via the shell, but the instructions are straight-forward. It integrates smoothly into GitHub, and makes version tracking of large files far easier. If you include it in a course-related repo (i.e. a fork of the homework repos), then there is no cost. If you want to use Git LFS for your own work, there are separate fees charged by GitHub for storage and bandwidth usage.

Aim higher!

You want to review and consolidate what you have learned so far, but try to push a bit outside your comfort zone. If in doubt, you can check in with me and I can help you assess if your plan is appropriate
Use a unique dataset, preferably something related to your own research interests
It is hard to provide a precise estimate of the time you should spend on this assignment, but you can take the average time you spent on each assignment so far as a reference. The time spent on this homework should be at least equal to that average time

Please notice: The goal of this homework is to show and strengthen what you have learned so far in the course. We expect you to complete it alone (you can rely on classmates for advice, but the homework itself needs to be completed alone). We expect you to adopt coding strategies that make sense to you, and code that you fully understand and can debug on your own.

Submit the assignment

Your repo should include everything you have used to produce your analyses (R scripts, R Markdown documents, data files, figures, etc.)

Make sure to stage-commit-push your original .Rmd file (or files if you used more than one Rmd or R script), and the report you generate from it as a github_document or a pdf document

To submit the assignment, simply push to your repository the last version of your assignment before the deadline. Then copy your repository URL (e.g., https://github.com/cfss-hmwks-s23/hw06-jmclip) and submit it to Canvas under HW06 before the deadline.

Rubric

Needs improvement: Cannot reproduce your results. Scripts require interactive coding to fix. Markdown documents are not generated and/or README.md is insufficient or absent. Minimal code. Graphs and tables are basic and/or don’t have appropriate labels or formatting. There is no consistency to code’s style.

Satisfactory: Solid effort. Hits all the elements. Finished all components of the assignment with only minor deficiencies. Easy to follow (both the code and the output).

Excellent: Repository contains a detailed README.md explaining how the files in the repo should be executed. Displays solid understanding of course materials including data analysis or coding skills. Code is complex and went beyond requirements. Graphs and tables are well labeled. Excellent implementation of a consistent style guide. Analysis is insightful.