HW07: Collecting and analyzing data from the web

Homework 07 due XXX

Overview

Due by 11:59 pm on July 13th

We learned two main ways of collecting data from the web:

  • Using APIs, with two options:
    • Accessing data using ad-hoc packages that wrap APIs
    • Running API queries by interacting directly with APIs
  • Web scraping

For the homework, you will create a new dataset using an API or web scraping and analyze it.

Accessing the hw07 repository

Go to this link and find your copy of the hw07 repository. It follows the naming convention hw07-<USERNAME>. Clone the repository to your computer. * Once your repository has been created, click on the link you see, which will take you to your repository. * Finally, clone the repository to your computer (or R workbench) following the process below.

Notice the repo you clone for this assignment is empty: you will have to fill it with your data and code, and push them to your github repo.

Cloning your hw07 repository

After you have accessed the hw7 repository (see above), follow the same steps you completed for hw1 to clone the repository.

General workflow

Your general workflow will be as follows:

  • Accept the repo and clone it (see above)
  • Make changes locally to the files in RStudio
  • Save your changes
  • Stage-Commit-Push: stage and commit your changes to your local Git repo; then push them online to GitHub. You can complete these steps using the Git GUI integrated into RStudio. In general, you do not want to directly modify your online GitHub repo (if you do so, remember to pull first); instead modify your local Git repo, then stage-commit-push your changes up to your online GitHub repo.

Assignment description

Goal: for this assignment, I want you to create a new dataset by getting data from the web and analyze it.

You can create a new dataset using any of the following options:

  1. Use an API wrapper for R
  2. Write an API query function to interact directly with APIs
  3. Scrape a website

You are free to choose your own adventure, but please notice:

  • I recommend selecting options 2 or 3 if you are interested in gaining experience scraping data (this is because for these options you will need to write your own code to query the server and obtain the data). I recommend selecting option 1, if you interested learning how to use an API wrapper package for R and further practice your data analysis and visualization skills.

  • If you go with option 3, you will need to use rvest to scrape the content of a web page and extract the relevant information (see Thursday lecture). Do not scrape pages that have dynamic components (e.g., if you need to scroll down a page to visualize the data, or need to click on pop-up windows, etc. – those websites require advanced scraping skills that are not covered in this class).

  • If you go with options 1 or 2, check what type of authentication/registration is required by the API you plan to use, and how long the process will take; this can vary from seconds, like for the GeoNames API we have seen in class, or up to weeks, like for the Twitter API (anything that takes more than 24-hour is not suitable for this homework). Also, make sure to search for and read the documentation on how to use the R wrapper package and/or the API

  • If you go with option 1, I expect thorough and detailed graphs and analyses (at the level of HW3) since you are choosing an easier method to obtain your data. If you pick this option and need a specific API wrapper package installed on the R Workbench, let me know ASAP to ensure it gets installed on time – you cannot install packages on your own on R Workbench. Otherwise, feel free to use (and expand on!) one of the several examples we have seen in class.

  • If you go with options 2 and 3, the analysis can be more limited, depending on how easy/hard it was for you to get the data (e.g., if you developed code from scratch VS. if you expand on code we covered in class or code you found online, etc.)

For all options:

  • You can use (but need to expand on) the examples we have reviewed in class. Quote all sources you consulted and explain how you used them. You are welcome to find inspirations/suggestions from online tutorials, as this frequently happens in real life. However, if students rely upon online sources, they must quote them and explain what they added/modified. The code produced for the assignment must be mostly novel and written by you (e.g., it cannot mirror or make only minimal adjustments to code found from online sources).

  • The expected minimal complexity of the code, should be along the same lines of a full developed example (with R wrapper package for an API), or the “OMDb” example (API without a wrapper), or of the “presidential statements” example we saw in class (for scraping) including functions and anything else that would make the scraper more efficient and scalable.

  • Some rules to follow: (1) everything that is publicly available generally can be scraped (what makes publicly available data is debated, but the HiQ Labs v. LinkedIn court case is the most common reference); (2) everything that is password protected cannot be scraped that is: private data that requires a username and passcode cannot be scraped; if you need to log in to scrape data, do not scrape them). For this specific assignment, I would suggest staying away from social media, unless you use their APIs. In general, if there is an API, use the API and do not scrape. Some websites have stricter rules, and they make them explicit, either in their robots.txt or Terms of Service (ToS)

  • Save the data you collect in your repository as a .csv file and upload the file in the repository; the end result must be a tidy data frame stored in the repository with some analytical component (exploratory descriptions and visualizations). Submit working and reprodubile code (e.g., no bugs, use relative paths, document your code, etc.)[1] [2]

Please notice: I have not tested nor run code for all these suggested APIs. These are options for you to consider, but you are free to use something else. Make sure to check if an R wrapper exists for the API you are interested into. Please, do not ask the instructional staff questions on how to use these APIs; this homework’s primary goal is for you to commit to one API that you find interesting and learn how to get data from it or commit to one webpage you want to get data from and learn how to scrape it.

Submit the assignment

Your repo should include everything you have used to collect the data and produce your analyses (R scripts, R Markdown documents, data files, etc.).

In your README.md: * explain the purpose of the repository * include an explanation of what your code does and how to use it. Include a description of the API (and of the API wrapper for R, if you are using one) or a description of the webpage you are scraping * provide any other relevant information that the user needs to know in order to use your repo and replicate your results * provide 1-2 paragraphs of reflections on what was hard/easy about this homework, what was enjoyable, problems you solved and how you solved them, helpful resources, etc. + list any collaborators and their role

Make sure to stage-commit-push your original .Rmd file, knit it as .md (e.g., github_document) and submit both the .Rmd and .md

To submit the assignment, push to your repository the last version of your assignment before the deadline. Then copy your repository URL (e.g., https://github.com/cfss-hmwks-s23/hw07-jmclip) and submit it to Canvas under HW07 before the deadline.

Rubric

Needs improvement: Data collection requires short/elementary code, and/or so do the analyses (e.g., graphs do not include any element more than basic ggplot code, they are unclear, and/or don’t have appropriate labels or formatting). There is little attention to reproducibility issues and little consistency in the code’s style.

Satisfactory: Solid effort. Hits all the elements. Finished all components of the assignment with only minor deficiencies. Easy to follow (both the code and the output).

Excellent: Displays in-depth understanding of course materials, including data analysis and coding skills. Write complex and refined code to get the data and/or to produce graphs and tables. An appropriate way to store authentication keys/passwords is implemented (if using an API or API wrapper).

Acknowledgments

{r child = here::here("R", "_ack_stat545.Rmd")} #

[1] If you are scraping from a web page that frequently updates its content, we may not perfectly reproduce your results. That’s fine - just make sure you’ve saved a copy of the data frame in the repo (as a .csv).

[2] Also if you write your own API function for a site that requires authentication, make sure to include instructions about where to store my API key so we can run your code without sharing your private key.