A2: Exploring and visualizing data
Overview
Now that you’ve demonstrated knowledge of your software is setup, the goal of this assignment is to practice transforming and visually exploring data.
Accessing your a2 repository
- Go to this link to accept
and create your private
a2repository on GitHub. Once you do so, your repository will be built in a few seconds. It follows the naming conventiona2-<USERNAME> - Once your repository has been created, click on the link you see, which will take you to your repository.
- Finally, clone the repository to your computer following the process below.
Cloning your a2 repository
After you have accessed the a2 repository (see above), follow the
same steps you completed for a1 to
clone the repository.
General workflow
Your general workflow will be:
- Accept the repo and clone it (see above)
- Make changes locally to the files in RStudio
- Save your changes
- Stage-Commit-Push: stage and commit your changes to your local Git repo; then push them online to GitHub. You can complete these steps using the Git GUI integrated into RStudio. In general, you do not want to directly modify your online GitHub repo (if you do so, remember to pull first); instead modify your local Git repo, then stage-commit-push your changes up to your online GitHub repo.
Please notice for this assignment we expect you to do some more work in terms of formatting and reproducibility: submit an assignment that fully complies with the Homework Guidelines.
Obtain the data
We’re going to use a dataset on wages, occupation, and gender,
gss_wages.
- If you are using R on your local computer, you first need to install
the
stevedatapackage. Type in your consoleinstall.packages("stevedata"). Then calllibrary(stevedata)anddata(gss_wages).
Explore the data
We are using a sample of GSS data. To learn more, see this dataset description here.
Very specific questions
- Load and describe the dataframe ‘gss_wages’
- Generate a data frame that summarizes the number of women and men
per education category Print the data frame as a formatted
kable()table (see below). - Generate a boxplot visualizing the wages of individuals, by recoded occupational category.
- Generate a bar chart that identifies the total number of children by each education category. The bars should be sorted from highest to lowest.
More open-ended questions
Answer the following questions. Generate appropriate figures/tables to support your conclusions.
- How many women and how many men have a high school education?
- How does the distribution of wages vary across the different (recoded) occupations? Reorder your chart from above to sort from lowest to highest median.
Very open-ended questions
Answer the following questions. Generate appropriate figures/tables to support your conclusions. Provide 1-2 paragraphs of written interpretation of your results for each question. Graphs and/or tables alone will not be sufficient to answer these questions. You will be graded on your code and your analysis.
- We generated total number of children by category. Is this helpful to understand the question of whether people with different educational backgrounds have more children? Explain why or why not, and include any supplemental visualizations you wish.
- Consider the findings regarding income and occupation. What additional variable might you use to help you uncover any additional underlying trends? Be specific and provide additional visualizations and/or tables as needed.
AI / Resources statement.
All assignments need about one paragraph describing the resources you used (including links and prompts, as relevant) in completing the assignment. This helps us learn about your process. Include this in your final assignment.
Formatting Guide
Formatting graphs
While you are practicing exploratory data analysis, your final graphs should be appropriate for sharing with outsiders. That means your graphs should have:
- A title
- Labels on the axes (see
?labsfor details)
This is just a starting point. Consider adopting your own color scales, taking control of your legends (if any), playing around with themes, etc.
Formatting tables
When presenting tabular data (aka dplyr::summarize()), make sure you
format it correctly. Use the kable() function from the knitr package
to format the table for the final document. For instance, this is a
poorly presented table:
# calculate total children
count(gss_wages, childs)
## # A tibble: 10 × 2
## childs n
## <dbl> <int>
## 1 0 16906
## 2 1 9864
## 3 2 15375
## 4 3 9590
## 5 4 4905
## 6 5 2246
## 7 6 1138
## 8 7 626
## 9 8 858
## 10 NA 189
Instead, use kable() to format the table, add a caption, and label the
columns:
count(gss_wages, childs) %>%
kable(
caption = "Number of children across respondents",
col.names = c("Total number of children", "Number of respondents")
)
| Total number of children | Number of respondents |
|---|---|
| 0 | 16906 |
| 1 | 9864 |
| 2 | 15375 |
| 3 | 9590 |
| 4 | 4905 |
| 5 | 2246 |
| 6 | 1138 |
| 7 | 626 |
| 8 | 858 |
| NA | 189 |
Run ?kable in the console to see how additional options.
Submit the assignment
To submit the assignment, simply push to your repository the last
version of your assignment before the deadline. Then copy your
repository URL (e.g., https://github.com/css-fall22/a2-brinasab) and
submit it to Canvas under A2 before the deadline.
Your assignment should be submitted as a R Markdown document .Rmd.
Need a refresher on R Markdown? Read
this or
this.
Make sure to stage-commit-push ALL of your files
Rubric
Needs improvement: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. Shows incomplete understanding of the packages needed for the assignment. No record of commits other than the final push to GitHub.
Satisfactory: Solid effort. Hits all the elements. Minor omissions but no clear mistakes. Easy to follow (both the code and the output). Shows sufficient understanding of the packages needed for the assignment.
Excellent: Finished all components of the assignment correctly. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Uses multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output. Shows solid understanding of the packages needed for the assignment.
For further details, see the general rubric we adopt for grading.