Multiple Regression in JMP

Example: Berkley Guidance Study (BGSgirls.JMP in the Biometry JMP folder)

The data for this example are excerpted from the Berkeley Guidance Study, a longitudinal monitoring of boys and girls in Berkelely, CA, between January 1928 and June 1929. The variables in the data for girls are:

- WT2 = weight at age 2 (kg)
- HT2 = height at age 2 (cm)
- WT9 = weight at age 9
- HT9 = height at age 9
- LEG9 = leg circumference at age 9 (cm)
- STR9 = a composite measure of strength at age 9 (high values = stronger)
- WT18 = weight at age 18
- HT18 = height at age 18
- LEG18 = leg circumference at age 18
- STR18 = strength at age 18
- SOMA = somatotype, a seven-point scale, as a measure of fatness (1 = slender, 7 = fat), determined using a photograph taken at age 18.

In this example we will develop a multiple regression model for SOMA at age 18 using as
potential predictors the variables from ages 2 and 9 only. We begin by
examining a scatterplot matrix of the potential predictors and the response, somatotype.
To do this in JMP select **Multivariate** from the **Analyze**
menu and place the predictors (WT2, HT2, WT9, HT9, LEG9, STR9) and the reponse (SOMA) in
the right hand box and click OK. To obtain pairwise correlations and tests of
their significance select the **Pairwise Correlations** options from the **Multivariate**
pull-down menu. The results are shown below:

We can see that weight and leg circumference at age 9 exhibit the strongest linear relationship with the response while height at age 2 and strength at age 9 exhibit the weakest correlation with the somatotype.

We now will fit a preliminary multiple regression model using all potential predictors.

**SOMA = bo + b1 WT2 + b2 HT2 +
b3 WT9 + b4 HT9 + b5 LEG9 + b6 STR9**

To fit this model in JMP select **Fit Model** from the **Analyze **menu
and place **SOMA **in the **Y box** and all of the predictors in
the **Effects in Model **box. The summary of this preliminary model is
given below.

Before beginning any model simplification we will examine residual plots to check basic
model assumptions. This plot is given at the bottom of the column of output with the
heading **Response Soma**. Another way to do this is to save the
residuals and fitted values to the spreadsheet and plot them using the **Fit Y by X**
option. To save these quantities select **Predicted Values **and **Studentized
Residuals **from the **Save Columns** pull-out menu located under the **Response
Soma **heading.

The plot suggests no obvious model violations. There is a mild outlier (X) in the lower right-hand corner of the plot. The stripes in the plot are due to the ordinal/discrete nature of the response and are of little concern. A normal quantile plot for the residuals is shown below. This is obtained in the usual way but we need the residuals saved to the data spreadsheet first.

With the exception of the outlier evident in previous plot normality appears to be satisfied. The effect tests for the individual predictors suggest that the model could be simplified by removing several terms.

The individual tests suggest that WT2, HT9, Leg9, & Str9 could potentially be removed from the model. We begin by taking out the height at age 9 (HT9) term. The results for this simpler model are shown below.

Leg circumference at age 9 would be removed next resulting the following:

Finally we will remove weight at age 2 (WT2).

Although the Str9 does not test as significant at the .05 level we will leave it in leaving us with the following model for the mean somatotype given HT2, WT9, and STR9.

**SOMA = 9.0498 - .08673 HT2 + .12302 WT9 -
.009827 STR9**

The negative coefficients for HT2 and STR9 seem surprising considering the fact that both are positively correlated with somatotype. To help understand how this can happen do the following:

**Use Distribution to obtain a histogram for WT9****Use Fit Y by X to construct scatterplots of SOMA vs. HT2 and SOMA vs. Str9.****Click on bars in the histogram for WT9 and examine the relationship between SOMA and HT2 for the highlighted points in the scatterplot. Do the same for the scatterplot of SOMA vs. Str9.****You should see that the relationship between SOMA and HT2 is negative when conditioning on WT9. This will also be the case for the relationship between SOMA and Str9.**

In multiple regression the marginal relationships between the response (Y) and the individual predictors (X) convey little useful information about their role in a multiple regression model!

Diagnostic plots (residuals vs. fitted and residual normal quantile) for the final three-predictor model are shown below.

Again no model violations are suggested. The plots below are called **Effect
Leverage Plots**. They are equivalent to a more commonly employed graphical
device called an **Added Variable Plot (AVP)**. These plots show
the relationship between the response (SOMA) and each of the predictors adjusted for the
other terms in the model. The negative estimated coefficients for HT2 and STR9
supported by the negative adjusted relationships for these terms. The outlier from the
residual plots clearly stands out in each of these plots. In each of these plots is
a visualization of the significance test for that terms. If the dashed red lines
does not completely contain the horizontal blue line then the term is deemed significant.
We can clearly see the marginal significance of STR9 from its AVP as the red lines
appear to completely contain the horizontal blue line.

A plot of the actual somatotype values vs. the fitted values from the model is shown below. The RSq = .52 is the square of the correlation between Soma Actual and Soma Predicted. RMSE is an estimate of the error standard deviation.