Introduction to Data Science Workshop (5/16/16 - 5/20/16)
Dominican University

Christopher Malone, Ph.D.
Brant Deppa, Ph.D.
Professors of Statistics & Data Science
Winona State University

Part 1 - Data Science Basics

1 - Introduction to R

2 - Programming in R

3 - Summaries in R

4 - Basic Graphics in R

5 - Data Management in R

6 - Association Rules

7 - Writing Custom Functions

8 - Twitter Mining

9 - Web Scraping

Part 2 - Supervised Learning/Predictive Analytics

1 - Introduction to Supervised/Statistical Learning (Predictive Analytics)


2 - Introduction to Multiple Regression

3 - Cross-validation to Estimate Predictive Performance

4 - Tree-based Models for Regression

  1. Regression Trees

  2. Bagging and Random Forests

  3. Boosting


5 - Tree-based Models for Classification

  1. Classification Trees

  2. Bagging and Random Forests

  3. Boosting


Note: Some of content in these handouts was co-created with our colleagues from WSU, Tisha Hooks and Silas Bergen



Alzheimers(workshop).csv - This is a classification problem. Samples of cerebrospinal fluid of subjects with and without signs of Alzheimers we used in this study. You goal is develop a model to classify the impairment status of the subjects from the measurements (predictors) collected from the cerebrospinal fluid. Also you should discuss which measurements are the best indicators of impairment.

Digit Recognition - Digits(train).csv and Digits(test).csv - This is a classification problem where we are trying to classify a hand-written digit correctly (0 - 9) based upon a 16 X 16 pixel grayscale image. Here is a picture of some of writing samples from the training data (Digit Images).

Polk County Iowa Home Selling Prices - (Polk(train).csv and Polk(test).csv) - This is a regression problem. Your goal is to develop a model to accurately predict the selling price of homes in Polk County, IA (this is country Des Moines is in). Data description file: Polk County.docx.

Compressive Strength of Concrete - This is a regression problem where you will try to predict the compressive strength of concrete as a function of several predictors related to composition of the concrete and the curing time. For model building you should split these data into training/validation or training/valdiation/test sets. Concrete.csv

Classifying Music Genre - Marsyas (Music Analysis, Retrieval, and Synthesis for Audio Signals) is an open source software framework for audio processing with specific emphasis on Music Information Retrieval Applications. Using this software music samples from 5 different music genres (Rock, Metal, Pop, Jazz, Blues, and Classical) were analyzed producing 191 predictors based on results from Marsyas. You goal is develop a model to classify the genre of music using training data and you measure the accuracy of your model by predicting the genre of test samples. Genre(train).csv and Genre(test).csv. For a smaller version in case the big ones don't load use GenreTrain(Rock vs Blues).csv and split this into a training/validation or training/validation/test sets in R.

Water Solubility - This is not the same as the example used in the workshop, though it is similar. This is from a different study where the using information about the chemical structure we are trying to predict whether the chemical is water soluable (1 = yes, 0 = no), thus this is a classification problem. The datasets are contained in the files: WaterSol (train).csv and WaterSol (actual).csv. Use the WaterSol (actual).csv as your validation set for assessing the predictive accuracy of your model.

More Diamonds - I pulled a fresh diamond data set from a online diamond seller this morning. The file BigDiamonds(round).csv contain 5,000 round diamond prices along with the same characteristics we used plus a few others. You will need to split it to form train/validation or train/validation/test data sets. BigDiamonds(round).csv and the file BigDiamonds.csv (DONT OPEN THIS ONE!!) actually contains the price of 130,000 diamonds of various shapes and sizes. Don't try to open the big one, I think it will kill the computer lab!

CA - Developmentally Disabled Expenditures Analysis

Census Bureau API


LINKS OF INTEREST (in no particular order)