Overview
- Discuss the need for distributed computing
- Illustrate the split-apply-combine analytical pattern
- Define parallel processing
- Define SQL
- Demonstrate how to access local and remote SQL databases
- Introduce Hadoop and Spark as distributed computing platforms
- Introduce the
sparklyrpackage - Demonstrate how to use
sparklyrfor machine learning using the Titanic data set
Before class
Install
sparklyrand H2O on your local computer. Run the code below to install all necessary packages and set the correct options.install.packages(c("sparklyr", "rsparkling")) options(rsparkling.sparklingwater.version = "2.1.0") library(sparklyr) spark_install(version = "2.1.0")Last year, 70% of students were able to successfully install these packages without problems. The others ran into problems. Make sure to attempt installing these packages before class so if you have errors we can debug them before you need to use the packages.
Class materials
The split-apply-combine strategy for data analysis - paper by Hadley Wickham establishing a general overview of split-apply-combine problems. Note that the
plyrpackage is now deprecated in favor ofdplyrand the othertidyversepackagesbigrquery- instructions for setting up an account to access Google Bigquery databasessparklyr- introduction to thesparklyrinterface for Spark