Introduction to Data Science Workshop (5/16/16 - 5/20/16)
Dominican University

Alzheimers(workshop).csv - This is a classification problem. Samples of cerebrospinal fluid of subjects with and without signs of Alzheimers we used in this study. You goal is develop a model to classify the impairment status of the subjects from the measurements (predictors) collected from the cerebrospinal fluid. Also you should discuss which measurements are the best indicators of impairment.

Digit Recognition - Digits(train).csv and Digits(test).csv - This is a classification problem where we are trying to classify a hand-written digit correctly (0 - 9) based upon a 16 X 16 pixel grayscale image. Here is a picture of some of writing samples from the training data (Digit Images).

Polk County Iowa Home Selling Prices - (Polk(train).csv and Polk(test).csv) - This is a regression problem. Your goal is to develop a model to accurately predict the selling price of homes in Polk County, IA (this is country Des Moines is in). Data description file: Polk County.docx.

Compressive Strength of Concrete - This is a regression problem where you will try to predict the compressive strength of concrete as a function of several predictors related to composition of the concrete and the curing time. For model building you should split these data into training/validation or training/valdiation/test sets. Concrete.csv

Classifying Music Genre - Marsyas (Music Analysis, Retrieval, and Synthesis for Audio Signals) is an open source software framework for audio processing with specific emphasis on Music Information Retrieval Applications. Using this software music samples from 5 different music genres (Rock, Metal, Pop, Jazz, Blues, and Classical) were analyzed producing 191 predictors based on results from Marsyas. You goal is develop a model to classify the genre of music using training data and you measure the accuracy of your model by predicting the genre of test samples. Genre(train).csv and Genre(test).csv. For a smaller version in case the big ones don't load use GenreTrain(Rock vs Blues).csv and split this into a training/validation or training/validation/test sets in R.

Water Solubility - This is not the same as the example used in the workshop, though it is similar. This is from a different study where the using information about the chemical structure we are trying to predict whether the chemical is water soluable (1 = yes, 0 = no), thus this is a classification problem. The datasets are contained in the files: WaterSol (train).csv and WaterSol (actual).csv. Use the WaterSol (actual).csv as your validation set for assessing the predictive accuracy of your model.

More Diamonds - I pulled a fresh diamond data set from a online diamond seller this morning. The file BigDiamonds(round).csv contain 5,000 round diamond prices along with the same characteristics we used plus a few others. You will need to split it to form train/validation or train/validation/test data sets. BigDiamonds(round).csv and the file BigDiamonds.csv (DONT OPEN THIS ONE!!) actually contains the price of 130,000 diamonds of various shapes and sizes. Don't try to open the big one, I think it will kill the computer lab!

CA - Developmentally Disabled Expenditures Analysis

Data: Excel: CSV | Native: CSV
Data Description: Link
Code: Link

Census Bureau API

Example Link: Example API
Variable Descriptions: Link
Census Bureau Developers Site: Link
Code: Link
Crosswalk Info from FIPS Codes to State to Region: Data: CSV | Code (to read in file): Link

LINKS OF INTEREST (in no particular order)

Brant Deppa's Website
This is the website for all of the courses I regularly teach at Winona State University. Of particular interest to budding data scientists would be my lecture notes/handouts for DSCI 425 - Supervised Learning and STAT 360 - Regression Analysis. I will also be developing a similar set of notes for DSCI 415 - Unsupervised Learning during the next year.
Comprehensive R Archive Network (CRAN)
This website allows you download R and the plethora of packages available.
R-Studio
This link will allow you to download R-Studio which is a R development environment.
UCI Machine Learning Repository
A great source of datasets to practice your predictive analytic skills.
Kaggle
This is a predictive analytic competition site where you can win big money predicting stuff. Thousands of data scientists from across the globe regularly compete so don't expect to win. There are several practice datasets found here and sample R scripts you can download and learn from. There also good resources for learning about data science and job opportunties available in the field.
KDNuggets
A website hosting lots of information about data mining, analytics, big data, and data science.
Data Science 101
A blog about Data Science by Ryan Swanstrom, a data scientist for Microsoft. Lots of good stuff here about how to get started in a career as a data scientist.
R-bloggers
A blog containing the latest R new and tutorials. There is a great discussion board for learning how to do things in R and overcome obstacles.
Data Camp
Offer numerous short courses (about $25 each) to teach R and Python, two of the main programming environments for doing data science.
Coursera
Lots of data science related courses available, including a certificate program develop by John's Hopkins University.
Udacity
Another place for relatively cheap online classes related to data science.
Redfin
A real estate website which allows for easy creation of datasets related to home selling prices. You can build yourself a giant database of homes currently for sale and develop predictive models for home prices. If you can do this well, perhaps Zillow.com or Redfin.com will hire you.

Introduction to Data Science Workshop (5/16/16 - 5/20/16)
Dominican University

Christopher Malone, Ph.D.
Brant Deppa, Ph.D.
Professors of Statistics & Data Science
Winona State University

Part 1 - Data Science Basics

1 - Introduction to R

2 - Programming in R

3 - Summaries in R

4 - Basic Graphics in R

5 - Data Management in R

6 - Association Rules

7 - Writing Custom Functions

8 - Twitter Mining

9 - Web Scraping

Part 2 - Supervised Learning/Predictive Analytics

1 - Introduction to Supervised/Statistical Learning (Predictive Analytics)

REGRESSION METHODS

2 - Introduction to Multiple Regression

3 - Cross-validation to Estimate Predictive Performance

4 - Tree-based Models for Regression

Regression Trees

Bagging and Random Forests

Boosting

CLASSIFICATION METHODS

5 - Tree-based Models for Classification

Classification Trees

Bagging and Random Forests

Boosting

Note: Some of content in these handouts was co-created with our colleagues from WSU, Tisha Hooks and Silas Bergen

POTENTIAL DATASETS FOR PROJECTS

LINKS OF INTEREST (in no particular order)

Comprehensive R Archive Network (CRAN)
This website allows you download R and the plethora of packages available.

R-Studio
This link will allow you to download R-Studio which is a R development environment.

UCI Machine Learning Repository
A great source of datasets to practice your predictive analytic skills.

KDNuggets
A website hosting lots of information about data mining, analytics, big data, and data science.

Data Science 101
A blog about Data Science by Ryan Swanstrom, a data scientist for Microsoft. Lots of good stuff here about how to get started in a career as a data scientist.

R-bloggers
A blog containing the latest R new and tutorials. There is a great discussion board for learning how to do things in R and overcome obstacles.

Data Camp
Offer numerous short courses (about $25 each) to teach R and Python, two of the main programming environments for doing data science.

Coursera
Lots of data science related courses available, including a certificate program develop by John's Hopkins University.

Udacity
Another place for relatively cheap online classes related to data science.

Redfin
A real estate website which allows for easy creation of datasets related to home selling prices. You can build yourself a giant database of homes currently for sale and develop predictive models for home prices. If you can do this well, perhaps Zillow.com or Redfin.com will hire you.

Introduction to Data Science Workshop (5/16/16 - 5/20/16) Dominican University

Christopher Malone, Ph.D. Brant Deppa, Ph.D. Professors of Statistics & Data Science Winona State University

Part 1 - Data Science Basics

1 - Introduction to R

2 - Programming in R

3 - Summaries in R

4 - Basic Graphics in R

5 - Data Management in R

6 - Association Rules

7 - Writing Custom Functions

8 - Twitter Mining

9 - Web Scraping

Part 2 - Supervised Learning/Predictive Analytics

1 - Introduction to Supervised/Statistical Learning (Predictive Analytics)

REGRESSION METHODS

2 - Introduction to Multiple Regression

3 - Cross-validation to Estimate Predictive Performance

4 - Tree-based Models for Regression

Regression Trees

Bagging and Random Forests

Boosting

CLASSIFICATION METHODS

5 - Tree-based Models for Classification

Classification Trees

Bagging and Random Forests

Boosting

Note: Some of content in these handouts was co-created with our colleagues from WSU, Tisha Hooks and Silas Bergen

POTENTIAL DATASETS FOR PROJECTS

LINKS OF INTEREST (in no particular order)

Comprehensive R Archive Network (CRAN) This website allows you download R and the plethora of packages available.

R-Studio This link will allow you to download R-Studio which is a R development environment.

UCI Machine Learning Repository A great source of datasets to practice your predictive analytic skills.

KDNuggets A website hosting lots of information about data mining, analytics, big data, and data science.

Data Science 101 A blog about Data Science by Ryan Swanstrom, a data scientist for Microsoft. Lots of good stuff here about how to get started in a career as a data scientist.

R-bloggers A blog containing the latest R new and tutorials. There is a great discussion board for learning how to do things in R and overcome obstacles.

Data Camp Offer numerous short courses (about $25 each) to teach R and Python, two of the main programming environments for doing data science.

Coursera Lots of data science related courses available, including a certificate program develop by John's Hopkins University.

Udacity Another place for relatively cheap online classes related to data science.

Redfin A real estate website which allows for easy creation of datasets related to home selling prices. You can build yourself a giant database of homes currently for sale and develop predictive models for home prices. If you can do this well, perhaps Zillow.com or Redfin.com will hire you.

Introduction to Data Science Workshop (5/16/16 - 5/20/16)
Dominican University

Christopher Malone, Ph.D.
Brant Deppa, Ph.D.
Professors of Statistics & Data Science
Winona State University

Comprehensive R Archive Network (CRAN)
This website allows you download R and the plethora of packages available.

R-Studio
This link will allow you to download R-Studio which is a R development environment.

UCI Machine Learning Repository
A great source of datasets to practice your predictive analytic skills.

KDNuggets
A website hosting lots of information about data mining, analytics, big data, and data science.

Data Science 101
A blog about Data Science by Ryan Swanstrom, a data scientist for Microsoft. Lots of good stuff here about how to get started in a career as a data scientist.

R-bloggers
A blog containing the latest R new and tutorials. There is a great discussion board for learning how to do things in R and overcome obstacles.

Data Camp
Offer numerous short courses (about $25 each) to teach R and Python, two of the main programming environments for doing data science.

Coursera
Lots of data science related courses available, including a certificate program develop by John's Hopkins University.

Udacity
Another place for relatively cheap online classes related to data science.

Redfin
A real estate website which allows for easy creation of datasets related to home selling prices. You can build yourself a giant database of homes currently for sale and develop predictive models for home prices. If you can do this well, perhaps Zillow.com or Redfin.com will hire you.