USCOTS 2015 - Exercises for Teaching Statistics with "Big" Data

Brant Deppa, Ph.D.
Professor of Statistics and Department Chair
Dept. of Mathematics & Statistics
Winona State University
Winona, MN 55987
web: http://course1.winona.edu/bdeppa
e-mail: bdeppa@winona.edu

Below you will find links to datasets, some of which I used in my presentation, and links to additional resources for finding or "creating" your own datasets. The data files are all in JMP format.

USCOTS 2015 - Handout - This is the handout for my portion of the workshop.

1) North Carolina Birth Statistics

I have used these data in some form in just about all of my courses, ranging from our first introductory statistics to biostatistics courses for our department majors and to statistics courses for WSU graduate nursing programs. It can be used to demonstrate just about any bivariate statistical method and can be used to develop multiple OLS and logistic regression models. Of particular interest would be to identify risk factors associated with adverse birth outcomes such as premature birth, low birth weight (< 2500 g), or small for gestational age. Students can examine risk factors such as maternal smoking during pregnancy, drinking alcohol during pregnancy, and adequacy of prenatal care. They can also consider the role of numerous physiological conditions such as hypertension and diabetes. There are also some demographic factors such as maternal race, education level, and age. There is very limited information about the father and you will find that it is missing for numerous mother/infant dyads (this is an interesting issue to explore).

Data Source: North Carolina Vital Statistics Dataverse (via The Odum Institute) - At The Odum Institute site you can find numerous other databases that might be of use in creating large real databases for use in your teaching. You can easily build a database consisting of millions of mother/infant dyads by downloading individual years and combining them. (Data Producer: State Center for Health Statistics - http://www.schs.state.nc.us/)

Data File: NC Birth Data (2003-2007).JMP - This is the data file used in the workshop. I have done some cleaning of the raw data, created new variables, and added some outside source information about maternal/health initiatives in North Carolina. I have randomly sampled 50,000 mother/infant dyads from each of these study years. Variable descriptions for the variables in this JMP file are containing in the Word document linked here and the variable description file from NC State Center for Health Statistics linked here. Because the sample sizes are very large, just about everything you might look at in terms of "statistical significance" (i.e. p < .05) is. Thus quantification of effect size is very important!

Smaller NC Birth Data (2003 - 2007).JMP

2) Mercury Contamination in Minnesota Walleyes

I have used these data in Biometry (STAT 305), which is an introductory statistics course for specifically for students majoring in the science (biology and geology primarily), and in our Regression Analysis (STAT 360) course. It can be used to demonstrate simple linear regression, data transformation, and inverse prediction. Walleyes are the state fish of Minnesota and are the most important game fish in MN (estimated 43,000 jobs and $2.8 billion in retail spending). The main contaminant found in MN walleyes is mercury which can have health consequences if ingested. Every year the MN Dept. of Natural Resources (DNR) and the MN Dept. of Health (MDH) publish waterway specific fish consumption guidelines for at-risk populations - children under age 15 and women who are or may become pregnant (too see current consumption guidelines click here).

Data Source: These data come from Minnesota's Fish Contaminant Monitoring Program (FCMP) which is a joint effort by the DNR, MDH, MN Dept. of Agriculture (MDA), and the Minnesota Pollution Control Agency (MPCA) from the years (1990 -1998).

Data Files: Walleyes Hg (1990-1998).JMP and Walleyes Hg (1990-1998) Major Waterways.JMP - the first file contains the results from all waterways sampled in this time period and the second contains data from the most sampled waterways in the first database. Here are two waterways near Duluth, MN - Island and Fish Lake Flowage only (Walleyes Fish vs. Island.JMP , Island Lake.JMP, Fish Lake Flowage.JMP)

3) Listing Prices of Homes in Minneapolis-St. Paul

I have primarily used these data in my supervised learning course (DSCI 425, formerly STAT 425) though it could certainly be used to demonstrate multiple regression analyses. These data were obtained using a real estate browsing site called Redfin (www.redfin.com). By using this site you can search numerous metropolitan areas for homes currently on the market within in the U.S. using a mapping tool and then download numerous characteristics of the properties on the map into a .csv file. By using Redfin you should be able to create a nice home price database near you for use in your classes.

Data Source: www.redfin.com

Data File: Twin Cities Homes (Single Family).JMP

Another good source of home prices and sales data are county assessors offices which will often times make property sales data available in .txt, .csv, or .xlsx format. For example here are links to two county assessors that have such files:

Polk County, IA (http://www.assess.co.polk.ia.us/web/exports/basic/resAllS.html)
Boulder, CO (http://www.bouldercounty.org/dept/assessor/pages/comp2013salesres.aspx).

County assessors generally provide more detail about the homes than redfin.com including assessed values and home quality measures. Prediction accuracy of models developed using these data can be compared to the reported accuracies of Zillow "Zestimates" (http://www.zillow.com/zestimate/#acc).

Examples of County Assessor Data Files: Polk County.JMP, Ames Housing.JMP, and Boulder.JMP

4) Sales of Orthopedic Equipment

These data come from the book "Statistical Consulting" by Javier Cabrera and Andrew McDougall and is available from the website for their book (http://pages.csam.montclair.edu/~mcdougal/zBook/book.html). The objective of this study is to identify U.S. hospitals where an orthopedic equipment company has the potential to increase their current sales. This might mean increasing current sales levels at hospitals that are already purchasing equipment or developing sales at hospitals where the company is not currently generating any. The level to which students are able attack this problem will certainly depend on their level of statistical training, but creative graphical analysis along with regression modeling should allow them to produce worthwhile results. I used these data in our unsupervised learning/data mining course (STAT 415)

Data Source: Taken from "Statistical Consulting" by Javier Cabrera and Andrew McDougall (data link)

Data File: Ortho.JMP

Data Description File: Ortho.docx

5) Breast Cancer Diagnosis using Fine Needle Aspiration

These data come from a study regarding the use of fine needle aspiration (FNA) to determine whether a breast tumor is malignant or benign. This work is the result of a collaboration at the University of Wisconsin-Madison between Prof. Olvi L. Mangasarian of the Computer Sciences Department and Dr. William H. Wolberg of the departments of Surgery and Human Oncology and they collected and analyzed these data in several published works. Here is link to website explaining more about this data and the FNA method (http://pages.cs.wisc.edu/~olvi/uwmp/cancer.html).

In FNA tumor cells are extracted from a breast tumor using a needle and the cells are then examined under a microscope and a digitized image of the tumor cells is taken. The cells nuclei in the image are then traced using a mouse and the computer then computes twelve size, shape, and texture readings from the traced cells. The final data consists of mean value, SE, and worst case (largest) value for each of these twelve cell characteristics from a sample of 579 tumors (212 Malignant and 357 Benign).

Introductory level students can use EDA to identify measured characteristics that would be best for discriminating between benign and malignant tumors. Given a characteristic such as cell radius students can comment on what makes benign and malignant tumors different. In regression you can use cell radius, perimeter, and area to examine how circular (spherical) tumor cells are and whether malignant & benign tumors differ in terms of how circular they are. Advanced students can build models or use multivariate methods to classify tumors as cancerous or not.

Data File: BreastDiag.JMP

6) Right Heart Catheterization and Mortality

I first found out about these data from news item in our local newspaper that stated that a study in the JAMA had shown there was a 25% increase in 30-day mortality for patients who had this procedure done. This was not surprising given this quote from the American Heart Associations webpage on RHC: "Diagnostic cardiac catheterization can be used to clarify a confusing or obscure situation in a patient whose clinical findings and noninvasive testing are unclear." Thus it would stand to reason if they are more ill or healthcare professionals are confused about what patient's heart is doing it might stand to reason they would have a higher mortality rate. Mayo Clinic, which is 45 miles from Winona, weighed in saying they were essentially critical of the finding and felt the procedure was a valuable tool in treating heart patients.

These data come from the aforementioned study of the effectiveness of right heart catheterization (RHC) in critically ill cardiac patients by Conners et al. (link to JAMA paper). I have used these data in biometry and biostatistics courses for majors and graduate nursing. These data can be used to evaluate risk factors for mortality in heart patients, look at confounding issues, develop logistics regression models for mortality, and conduct survival analyses. Students can examine for themselves if indeed the use of a RHC is associated with increased mortality.

Data File: Right Heart Catheterization.JMP

Data and Variable Descriptions: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/rhc.html

7) Fatty Acid Content of Italian Olive Oils

These data consist of the percentage composition of eight fatty acids found by lipid fraction of 572 Italian olive oils. The data come from three regions: Southern Italy, Sardinia, and Northern Italy. Within each region there are a number of different areas. Southern Italy comprises North Apulia, Calabria, South Apulia, and Sicily. Sardinia is divided into Inland Sardinia and Costal Sardinia. Northern Italy comprises Umbria, East Liguria, and West Liguria. The primary question is “How do we distinguish the oils from different regions and areas in Italy based on their combinations of the fatty acids?”

Data Source: M. Forina, C. Armanino, S. Lanteri, E. Tiscornia, Classification of Olive Oils from their Fatty Acid Composition, in H. Martens, H.Jr Russwurm Eds, Food Research and Data Analysis, Applied Science Pub., London, (1983) 189-214

Data File: Olive Oils.JMP

Statistical Characterization of Sicilian Olive Oils froms the Peloritana and Maghrebian Zones According to Fatty Acid Profile (Di Bella, et al.)

8) Survey of WSU STAT 110 Students

At Winona State University we have 10+ sections of our introductory statistics course (STAT 110) each semester with approx. 40 students in each. In the past we have had students enrolled in these sections complete a survey that generates a nice dataset that can be used in teaching the course. Below are two JMP files containing the results of two surveys from past years. Online tools like Qualtrics or SurveyMonkey can be used to generate the survey and collect the results. The "data start-up" time is very low, making such a database ideal for use in an introductory statistics course.

Data Source: Surveys developed by Brant Deppa and April Kerby - Winona State University

Data Files: Survey 110.JMP and Student Survey Fall 2012.JMP