Multiple Regression in JMP
Example: Berkley Guidance Study(BGSgirls.JMP in the Biometry JMP folder)
The data for this example are excerpted from the Berkeley Guidance Study, a longitudinal monitoring of boys and girls in Berkelely, CA, between January 1928 and June 1929. The variables in the data for girls are:
In this example we will develop a multiple regression model for SOMA at age 18 using as potential predictors the variables from ages 2 and 9 only. We begin by examining a scatterplot matrix of the potential predictors and the response, somatotype. To do this in JMP select Multivariate from the Analyze menu and place the predictors (WT2, HT2, WT9, HT9, LEG9, STR9) and the reponse (SOMA) in the right hand box and click OK. To obtain pairwise correlations and tests of their significance select the Pairwise Correlations options from the Multivariate pull-down menu. The results are shown below:
We can see that weight and leg circumference at age 9 exhibit the strongest linear relationship with the response while height at age 2 and strength at age 9 exhibit the weakest correlation with the somatotype.
We now will fit a preliminary multiple regression model using all potential predictors.
SOMA = bo + b1 WT2 + b2 HT2 + b3 WT9 + b4 HT9 + b5 LEG9 + b6 STR9
To fit this model in JMP select Fit Model from the Analyze menu and place SOMA in the Y box and all of the predictors in the Effects in Model box. The summary of this preliminary model is given below.
Before beginning any model simplification we will examine residual plots to check basic model assumptions. This plot is given at the bottom of the column of output with the heading Response Soma. Another way to do this is to save the residuals and fitted values to the spreadsheet and plot them using the Fit Y by X option. To save these quantities select Predicted Values and Studentized Residuals from the Save Columns pull-out menu located under the Response Soma heading.
The plot suggests no obvious model violations. There is a mild outlier (X) in the lower right-hand corner of the plot. The stripes in the plot are due to the ordinal/discrete nature of the response and are of little concern. A normal quantile plot for the residuals is shown below. This is obtained in the usual way but we need the residuals saved to the data spreadsheet first.
With the exception of the outlier evident in previous plot normality appears to be satisfied. The effect tests for the individual predictors suggest that the model could be simplified by removing several terms.
The individual tests suggest that WT2, HT9, Leg9, & Str9 could potentially be removed from the model. We begin by taking out the height at age 9 (HT9) term. The results for this simpler model are shown below.
Leg circumference at age 9 would be removed next resulting the following:
Finally we will remove weight at age 2 (WT2).
Although the Str9 does not test as significant at the .05 level we will leave it in leaving us with the following model for the mean somatotype given HT2, WT9, and STR9.
SOMA = 9.0498 - .08673 HT2 + .12302 WT9 - .009827 STR9
The negative coefficients for HT2 and STR9 seem surprising considering the fact that both are positively correlated with somatotype. To help understand how this can happen do the following:
In multiple regression the marginal relationships between the response (Y) and the individual predictors (X) convey little useful information about their role in a multiple regression model!
Diagnostic plots (residuals vs. fitted and residual normal quantile) for the final three-predictor model are shown below.
Again no model violations are suggested. The plots below are called Effect Leverage Plots. They are equivalent to a more commonly employed graphical device called an Added Variable Plot (AVP). These plots show the relationship between the response (SOMA) and each of the predictors adjusted for the other terms in the model. The negative estimated coefficients for HT2 and STR9 supported by the negative adjusted relationships for these terms. The outlier from the residual plots clearly stands out in each of these plots. In each of these plots is a visualization of the significance test for that terms. If the dashed red lines does not completely contain the horizontal blue line then the term is deemed significant. We can clearly see the marginal significance of STR9 from its AVP as the red lines appear to completely contain the horizontal blue line.
A plot of the actual somatotype values vs. the fitted values from the model is shown below. The RSq = .52 is the square of the correlation between Soma Actual and Soma Predicted. RMSE is an estimate of the error standard deviation.