class: center, middle, inverse, title-slide .title[ # Getting data from the web: API access ] .author[ ### MACSS 30500
University of Chicago ] --- class: inverse, middle # Getting data from the web (aka web scraping) --- ## Agenda 1. Broad overview of APIs and getting data 1. Examples: here we're going to increase in order of difficulty / challenge 1. world bank data 1. geonames 1. OMDB > Note: You'll want to get your accounts setup so that you have an API key for today's exercises if you haven't already: [see class page for links > and details](https://macs30500.netlify.app/syllabus/getting-data-from-the-web-api-access/) --- <img src="webscraping.png" width="70%" style="display: block; margin: auto;" /> [Image Source](https://medium.com/analytics-vidhya/web-scraping-in-python-for-data-analysis-6bf355e4fdc8) --- ## Web scraping Web scraping is **the process of collecting or "scraping" information from a website**. If you have ever copied and pasted information from the Internet, you have performed the same task as any scraper, just on a small scale! Web scraping allows automating this process to collect hundreds, thousands, or even millions of information. -- Examples: * companies' names, emails, and phone numbers * newspaper articles * customer reviews * products' prices, descriptions, and characteristics * real estate data * statistical data * social media data * etc. --- class: inverse, middle # Two main ways to scrape data from the web ### Option 1: Using a web API (with or without a R package) ### Option 2: Directly scraping the website --- ## Two main ways to scrape data from the web ### Option 1: Using a web API > A web API (Application Programming Interface) is an interface provided by the website that helps users to collect data from that specific website. When using an API, we might have two approaches available to us: * approach 1: using an API through an R package that someone has written to interact with that API. The package acts as a **wrapper for the given API**, and is generally easier to use than the API. If such a package exists (not all APIs have a wrapper in R) all you have to do is to install it, and learn how to use it (e.g., read the documentation for that specific package) * approach 2: if no R wrapper exists, you can **directly use the API** provided by the website. In this case, the user relies on R to directly interact with the website's API. NB: sometimes both approaches are available; other times only approach 2 is available (if no one has developed an R package to interact with that specific API). --- ## Two main ways to scrape data from the web ### Option 2: Directly scraping the website > Every website, behind the pretty display we see, is made up of code (a mix of HTML, CSS and Javascript). Therefore, each website, and the information stored in it, are directly accessible by users, if they know how to interact with the website's raw code. > Often scraping a website directly is necessary to collect the data we need. However, (1) if a website has an API, the general rule is to use it; and (2) you want to check the website's rules for scraping (ToS and `robots.txt`) --- ## Our plan Today we focus on option 1: using APIs. We see examples of both approaches: using an API in R, with and without an R wrapper. Next lecture we focus on option 2: direct scraping. But, before we delve into the APIs world, there are two broad concepts that we need to learn: > Concept 1: **What is behind a website** (e.g., what websites are made of) > Concept 2: **How the web works** (e.g., how computers interact on the web) --- class: inverse, middle # Concept 1: What is behind a website (e.g., what websites are made of) --- ## What is behind a website A website is made of the following elements: + **HTML**, which means *HyperText Markup Language*, is the core element of a website. HTML uses a set of tags to organize the webpage (i.e., makes the text bold, creates body text, paragraphs, inserts hyperlinks, etc.), but when the page is displayed the markup language is hidden + **CSS**, which means *Cascading Style Sheets*, adds styling to make the page looks nicer + **Javascript (JS)** code, which is used to add interactive elements to the page (you need "dynamic web scraping" techniques to interact with JS) + **Other stuff** such as images, hyperlinks, videos or multimedia --- ## HTML * most important element we need to learn for web scraping * makes the "skeleton" or structure of a website * quite messy to read, but it follows a hierarchical, tree-like structure since it embeds tags within tags (everything marked with `<>` is a tag) Standard HTML syntax, simplified example: ``` <html> <head> <title>general info about the page</title> </head> <body> <p>a paragraph with some text about the page</p> <p>another paragraph with more text</p> <p>...</p> </body> </html> ``` Tree-like structure:[click here for a visual example of html](https://www.researchgate.net/figure/HTML-source-code-represented-as-tree-structure_fig10_266611108) --- ## HTML Tags In web scraping, we collect information from websites using **tags**. Tags: * are organized in a tree-like structure and are nested within each other * go in pairs: one on each end of the content that they display; for example `<p>ciao</p>` only the word "ciao" shows up on the webpage (the `/` signals the end of the tag) * can have attributes which provide more information Refer to this **[list of tags](https://developer.mozilla.org/en-US/docs/Web/HTML/Element )** for your scraping projects <!-- add more info here https://plsc-31101.github.io/course/collecting-data-from-the-web.html#webscraping * more frequently used tags * more on tags attributes * CSS * CSS and HTML * see staff from my Python course, lecture 2 (e.g. every page is different, no perfect structure,etc.) --> -- **Knowing how read the HTML (and CSS) language, is fundamental for web scraping.** Especially when we scrape the website directly. When we use an API and/or an API wrapper, this is less important, but still useful. --- class: inverse, middle # Concept 2: How the web works (e.g., how computers interact on the web) --- ## High-level: How do computers interact on the web? Computers talk to each other on the web by sending and receiving (GET) **data requests** and (POST) **data responses**: some making requests, some receiving and answering them, some doing both. Every computer has an address that other computers can refer to. When you click on a webpage, the **web browser** of your computer (e.g., chrome, safari, etc.) makes a data request to the **web server** of that page (a database where all the info about that page are stored), and gets back a response. <img src="request_response.png" width="50%" style="display: block; margin: auto;" /> Image Source [at this link](https://www.linkedin.com/pulse/what-happens-when-you-enter-url-browser-he-asked-victor-ohachor) --- ## Example: How do computers interact on the web? Navigating the web basically means **sending a bunch of requests to different servers and receiving different files written in HTML.** For example, if you type `https://macss.uchicago.edu/current-student-resources` into your web browser and hit enter, these steps occur under the hood: -- 1. your web browser translates what you typed into a HTTP request to tell the `macss` web server that you would like to access the info stored at `/content/current-student-resources` and that the protocol to use is the `https` -- 1. the web server that hosts `macss` receives your request and sends back to your web browser an HTTP response and the response content (a bunch of files written in HTML) -- 1. your browser receives and transforms the response content into a nice visual display that might include texts, graphics, hyperlinks, etc. -- When we perform these operations with the goal of scraping data, there are packages in R that perform these steps for us (either via APIs or without). --- class: inverse, middle # Scraping using an API: Application Programming Interface --- ## API: Terminology > A web API (Application Programming Interface) is an interface provided by the website that helps users to collect data from that specific website. The majority of web APIs use a particular style know as **REST** or **RESTful** which stands for *Representational State Transfer*. These allows to query the website database using URLs, just like you would construct an URL to view a web page. An **URL**, which stays for *Uniform Resource Location*, is a string of characters that uses the **HTTP**, or *HyperText Transfer Protocol*, to send request for data. --- ## Sending queries to APIs The process described in the `macss` example is very similar to how APIs work, with only a few changes: * The character string that you use for sending requests (aka queries) to an API (e.g., the URL or website address) will have search terms and/or filtering parameters, and authentication codes specific to that API. * The response you get back from the API server is not formatted as HTML, but it will likely be a raw text response. * You need R to parse that response and convert it into a format that you like (dataframe, lists, etc.) and then export it as `.csv` or `.json`. --- ## Using APIs with a wrapper package * Specific packages written for existing APIs. If a package for a given API exists, that package is tailored to that specific API and not transferable to other APIs. * Wrapper packages can be useful because they are * Reproducible * Up-to-date (ideally) * Easy to access * How do I know if a R wrapper package for a specific API exists? **Google it** * How do I learn how to use a specific R wrapper package? **Each package should come with documentation (either a pdf, or a GitHub repo, or both) that explains how to use it and the main functions defined in it. It is essential to read the documentation to understand how to use a package.** --- ## Using APIs with a wrapper package: examples The class materials for today (download it from the website: `usethis::use_course("CFSS-MACSS/web-api-access")`) include five examples of APIs that have a wrapper package: * Wordbank * GeoNames * Omdb database * Census * Manifesto Project We will review the first two in class. Please, work through the other examples on your own. `usethis::use_course("CFSS-MACSS/web-api-access")` --- ## Using APIs with a wrapper package: examples **Wordbank database API**: `wbstats` R wrapper package * rich set of socioeconomic indicators spanning several decades and dozens of topics * full data available for bulk download as CSV files from their website; but frequently you only need to obtain a handful of indicators or a subset of countries * the `wbstats` package provides a wrapper for R for easier access to the API; it returns the results in a tidy data frame **GeoNames geographical database API**: `geonames` R wrapper package * geographical information for all countries and other locations * the `geonames` package provides a wrapper for R --- ## Using APIs with a wrapper package: examples **Census Bureau API**: `tidycensus` R wrapper package * statistical data from the US Census Bureau * the `tidycensus` provides a R wrapper for the US Census Bureau’s Census and five-year American Community Survey APIs **Manifesto Project API**: `manifestoR` R wrapper package * political party manifestos from around the world * covers 1,000+ parties from 1945 until today in 50+ countries on five continents. * the `manifestoR` package provides a wrapper for R --- ## Using APIs directly (without a wrapper package) Not all APIs have a R wrapper package to help us interact with them. If such wrapper does not exist, we can always interact directly with the API! This usually entrails more code (such as writing our own functions), but also the ability to customize our data requests. The class materials for today also include one example of this kind: **OMDb Open Movie Database** **Looking for more examples?** See the A6 instructions for suggestions --- class: inverse, middle # Scraping using an API: registering for access --- ## Using APIs: register for access **Many APIs require you to register for access.** Sometimes this is as quick as providing email and password; and receiving an email with your username, private API key, etc.. Other times you have to submit an application and go through a review process (e.g., Twitter). Often this process is free, but some APIs require you to pay. Registering for access, allows APIs to track which users are submitting queries and manage demand. If you submit too many queries too quickly, you might be **rate-limited** and your requests de-prioritized or blocked: when in doubt, check the API access policy of the web site to determine what these limits are. Notice if a given API requires you to register and obtain a username, password, or key to interact with it, you will have to provide this same information also in the R wrapper package. --- ## Using APIs: store username and other private info **To tell R your APIs username (and key, if necessary), you have two options**. For example, the `geonames` API requires you to register and set a username: 1. You could run the line `options(geonamesUsername = "my_user_name")` in R. 1. However, this is insecure, especially if you plan to put your work on GitHub. We don't want to risk committing this line and pushing it to our public GitHub page. Instead, you should (please do so for your A6): * create a file in the same place as your `.Rproj` file. To do that, run the following command from the R console `usethis::edit_r_profile(scope = "project")`. This will create a special file called `.Rprofile` in the same directory as your `.Rproj` file (assuming you are working in an R project) * the file should open automatically in your RStudio script editor; otherwise open the file and add `options(geonamesUsername = "my_user_name")` to that file, replacing `my_user_name` with your Geonames username. --- ## Using APIs: store username and other private info Important: * Make sure your `.Rprofile` ends with a blank line * Make sure `.Rprofile` is included in your `.gitignore` file, otherwise it will be posted on GitHub * Restart RStudio after modifying `.Rprofile` in order to load any new keys into memory * Spelling is important when you set the option in your `.Rprofile` --- ## Exercises For the remainder of class, we'll walk through these resources: `usethis::use_course("CFSS-MACSS/web-api-access")` 1. **world bank**: this has a wrapper and does not require any user names / prior registration 1. **geonames**: this has ai wrapper and requires a user name with prior registration (free) 1. **OMDB**: this does not have a wrapper and requires an api key with prior registration (free) `usethis::use_course("CFSS-MACSS/web-api-access")` --- ## Acknowledgments APIs examples are drawn from Rochelle Terman’s "Finding APIs" page [here](https://plsc-31101.github.io/course/collecting-data-from-the-web.html#finding-apis) and Sabrina Nardin and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.