A2: Exploring and visualizing data

Assingment 2 due 6/23/25

Overview

Now that you’ve demonstrated knowledge of your software is setup, the goal of this assignment is to practice transforming and visually exploring data.

Accessing your `a2` repository

Go to this link to accept and create your private a2 repository on GitHub. Once you do so, your repository will be built in a few seconds. It follows the naming convention a2-<USERNAME>
Once your repository has been created, click on the link you see, which will take you to your repository.
Finally, clone the repository to your computer following the process below.

Cloning your `a2` repository

After you have accessed the a2 repository (see above), follow the same steps you completed for a1 to clone the repository.

General workflow

Your general workflow will be:

Accept the repo and clone it (see above)
Make changes locally to the files in RStudio
Save your changes
Stage-Commit-Push: stage and commit your changes to your local Git repo; then push them online to GitHub. You can complete these steps using the Git GUI integrated into RStudio. In general, you do not want to directly modify your online GitHub repo (if you do so, remember to pull first); instead modify your local Git repo, then stage-commit-push your changes up to your online GitHub repo.

Please notice for this assignment we expect you to do some more work in terms of formatting and reproducibility: submit an assignment that fully complies with the Homework Guidelines.

Obtain the data

We’re going to use a dataset on wages, occupation, and gender, gss_wages.

If you are using R on your local computer, you first need to install the stevedata package. Type in your console install.packages("stevedata"). Then call library(stevedata) and data(gss_wages).

Explore the data

We are using a sample of GSS data. To learn more, see this dataset description here.

Very specific questions

Load and describe the dataframe ‘gss_wages’
Generate a data frame that summarizes the number of women and men per education category Print the data frame as a formatted kable() table (see below).
Generate a boxplot visualizing the wages of individuals, by recoded occupational category.
Generate a bar chart that identifies the total number of children by each education category. The bars should be sorted from highest to lowest.

Very open-ended questions

Answer the following questions. Generate appropriate figures/tables to support your conclusions. Provide 1-2 paragraphs of written interpretation of your results for each question. Graphs and/or tables alone will not be sufficient to answer these questions. You will be graded on your code and your analysis.

We generated total number of children by category. Is this helpful to understand the question of whether people with different educational backgrounds have more children? Explain why or why not, and include any supplemental visualizations you wish.
Consider the findings regarding income and occupation. What additional variable might you use to help you uncover any additional underlying trends? Be specific and provide additional visualizations and/or tables as needed.

AI / Resources statement.

All assignments need about one paragraph describing the resources you used (including links and prompts, as relevant) in completing the assignment. This helps us learn about your process. Include this in your final assignment.

Formatting Guide

Formatting graphs

While you are practicing exploratory data analysis, your final graphs should be appropriate for sharing with outsiders. That means your graphs should have:

A title
Labels on the axes (see ?labs for details)

This is just a starting point. Consider adopting your own color scales, taking control of your legends (if any), playing around with themes, etc.

Formatting tables

When presenting tabular data (aka dplyr::summarize()), make sure you format it correctly. Use the kable() function from the knitr package to format the table for the final document. For instance, this is a poorly presented table:

# calculate total children 
count(gss_wages, childs)

## # A tibble: 10 × 2
##    childs     n
##     <dbl> <int>
##  1      0 16906
##  2      1  9864
##  3      2 15375
##  4      3  9590
##  5      4  4905
##  6      5  2246
##  7      6  1138
##  8      7   626
##  9      8   858
## 10     NA   189

Instead, use kable() to format the table, add a caption, and label the columns:

count(gss_wages, childs) %>%
   kable(
     caption = "Number of children across respondents",
     col.names = c("Total number of children", "Number of respondents")
  )

Number of children across respondents
Total number of children	Number of respondents
0	16906
1	9864
2	15375
3	9590
4	4905
5	2246
6	1138
7	626
8	858
NA	189

Run ?kable in the console to see how additional options.

Submit the assignment

To submit the assignment, simply push to your repository the last version of your assignment before the deadline. Then copy your repository URL (e.g., https://github.com/css-fall22/a2-brinasab) and submit it to Canvas under A2 before the deadline.

Your assignment should be submitted as a R Markdown document .Rmd. Need a refresher on R Markdown? Read this or this.

Make sure to stage-commit-push ALL of your files

Rubric

Needs improvement: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. Shows incomplete understanding of the packages needed for the assignment. No record of commits other than the final push to GitHub.

Satisfactory: Solid effort. Hits all the elements. Minor omissions but no clear mistakes. Easy to follow (both the code and the output). Shows sufficient understanding of the packages needed for the assignment.

Excellent: Finished all components of the assignment correctly. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Uses multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output. Shows solid understanding of the packages needed for the assignment.

For further details, see the general rubric we adopt for grading.