Stepwise Regression

Stepwise Regression

September 2015

You would like to be able to predict what sales will be.  You have a database that contains 40 different variables that might impact sales.  How can you go through these 40 variables to see which ones really impact sales and could become part of a model to predict sales?  Well, you could run a full regression analysis with all 40 variables and see which ones are “significant.”  But regression models change as variables are added or removed.

stepwiseThis is where stepwise regression can help.  This is an automated process that builds a regression model for you by going through a series of steps of adding the most significant variable or removing the least significant variable.

This month’s publication takes a look at stepwise regression.  On the surface, this technique sounds great.  Sit back and let the model be built for you.  As with all techniques, there are some caveats about using stepwise regression.  So, we will take a look at how stepwise regression can easily build a model for you as well as a few of the drawbacks of stepwise regression.

In this issue:

You may download a pdf version of this publication at this link.  Please feel free to leave a comment at the end of this page.

Regression Review

reviewIn regression, we are trying to build a model to predict Y based on certain predictor variables (x1, x2, etc.).   For example, if we have two predictor variables, we would be building a regression equation of the form:

Y = bo + b1x1 + b2x2

where b0 is the y-intercept and b1 and b2 are the coefficients for the predictor variables x1 and x2.

If a predictor variable (e.g., x1) does not impact Y, we would expect the coefficient (e.g., b1) to be zero.  However, there is variation in our processes and, when you run a regression, the coefficients that do not impact Y are not zero.  We need a method of determining if a coefficient is sufficiently close to zero to be called zero or is far enough away from zero to be considered significant.  We do this through the p value associated with a t-test.

For example, consider the dataset in Table 1.  We have collected data on x1, x2 and Y.  We want to use regression analysis to build a model for Y.

Table 1: Regression Dataset

Samplex1x2Y Samplex1x2Y
18852186 119557196
28257174 1211729242
39560197 139521199
410125210 1410422218
510241214 1510242209
69733204 168724183
710147209 1710560219
89924208 1810120208
99237190 1910660220
109331196 2011042228

 

Part of the output from the regression analysis using SPC for Excel is shown in Table 2.  The table is described below.

Table 2: Partial Regression Output

 Coeff.t Statp Value95% Lower95% Upper
Intercept13.622.7550.01353.18924.05
x11.95441.0560.00001.8542.055
x2-0.0206-0.7740.4497-0.07700.0357

 

The second column in Table 2 gives the coefficients (the b values in the regression equation).  The “t Stat” column gives the t statistic associated with the coefficient.  The “p Value” column gives the p values associate with the t statistic.

how close to zeroThis p value is one key to interpreting the results.   It is testing whether the coefficient is equal to zero (the null hypothesis in statistical jargon).  A low value of p (usually < 0.05) suggests that the coefficient is not zero.  This means that the coefficient most likely can be added to the model because it appears that changes in that predictor variable impacts the Y response.  A high value of p suggests that the coefficient is zero.  This means that the predictor variable does not impact the Y response and should not be included in the model.

Table 2 also shows the 95% confidence interval.  If that interval contains zero, then it is possible that the coefficient is zero (at that confidence interval) and should not be in the model.

From Table 2, it can be seen that x1 appears to be significant (p value < 0.05 and interval does not contain 0), while x2 is not (p value > 0.05 and interval contains zero).  The regression could be re-run with just x1 in the model to create the final model.

As we will see, stepwise regression uses the p value to add or remove predictor variables from the model.

Introduction to Stepwise Regression

Stepwise regression adds or removes predictor variables based on their p values.   The first step is to determine what p value you want to use to add a predictor variable to the model or to remove a predictor variable from the model.  A common approach is to use the following:

p value to enter = Penter = 0.15

p value to remove = Premove = 0.15

A process flow diagram for stepwise regression is shown in Figure 1.

Figure 1: Stepwise Regression

stepwise regression process flow diagram

You start with no predictor variables in the model.  Regress each predictor variable, individually, with Y.  Then you compare the p values for each predictor variable.  If there are no p values less than Penter, then there are no predictor variables that go into the model and the stepwise regression ends.  If there are p values less than Penter, then the predictor variable with the lowest p value is added to the model.

At this point, the model contains 1 predictor variable.  The looping action begins now.  Each individual predictor variable that is not in the model is added to the model and the regression run.  So, if x1 is in the model, you would run the regression for (x1, x2), (x1, x3) up to (x1, xk) where k is the number of predictor variables.  If any p value for xi not in the model is less than Penter, then the xi, not in the model,   with  smallest p value is added to the model.

Adding additional predictor variables to the model can change the p values of the predictor variables already in the model.  A check is made to see if any predictor variable in the model have a  p value greater than Premove.  If so, the predictor variable with the greatest p value is removed.

This process continues until no more predictor variables can be added to or removed from the model.  The example below shows the process of adding and removing predictor variables.

Stepwise Regression Example

You are a VP of sales and have responsibility for 41 stores.  You have collected data from the stores on advertising costs, store size in square feet, % employee retention, customer satisfaction score, whether a promotion was run or not and sales.  You want to build a model that can predict sales based on these five variables.  The data are shown in Table 3.

Table 3: Store Data

StoreAdvertising CostsSize (Sq. Ft)% Employee RetentionCustomer SatisfactionPromotionSales
1124.422560593211581
215431181623312139
3123.516314782801043
416324205664301702
510717574822211339
6143.91958467340521
7133.722682573211720
8121.423398642311197
9104.61950788250950
1099.21144387170266
1193.816832822911718
12133.824326703711820
13131.318541862711805
14123.222099673001042
1588.31692880210655
16154.316237692811480
17112.115290871911057
181141294780230953
1991.71632677240364
20113.71302482270783
21105.72205486281792
22161.322637753412185
2314318733642611051
24113.721126583001456
25691681957180146
26106.71699274220899
27125.818355882511243
2852.41395887250421
29114.71829871240318
30142.51801655270383
31114.62231780271993
32150.526221713111766
3374.519494682411123
34111.418406852411523
35117.718880862411281
36108.421312813011899
37171.726618723512508
38156.515981722501042
39156.320749613201416
40140.622458772711659
41155.118579853201370

 

A stepwise regression was done on these data using the SPC for Excel software.  The p values to add and remove were both set at 0.15.

The first step was to regress Y on each predictor variable.  This simply means run regression for each predictor variable alone versus Y.  Then, the predictor variable with the lowest p value is added to the model (as long as is there is a predictor variable with a p value < 0.15.  The store size had the lowest p value so it is added to the model in the first step.  The output is shown below.

Step 1: Added Size (Sq. Ft)
Variable Coefficientt Stat p Value
Intercept-633.8-1.893 0.066
Size (Sq. Ft) 0.09465.6180.000

 

The second step is then to include each predictor variables (one at time) in the model that includes store size and run the regression.  If any p value for a predictor variable that is not in the model is less than 0.15, that predictor variable is added to the model.  Note that if two of the predictor variables have a p value less than 0.15, the predictor variable with the lowest p value is added to the model.  You are only adding or removing one variable at a time.   In this case, promotion had the lowest p value, so it is added to the model.

Step 2: Added Promotion
Variable Coefficientt Stat p Value
Intercept-355.3-1.163  0.252
Size (Sq. Ft) 0.06754.0360.000
Promotion464.53.5010.001

 

After a variable has been added, you check to see if any of the predictor variables in the model now have a p value greater than 0.15 (the p value to remove).  This is not the case, so the model now has two terms: store size and promotion.

The process repeats.  Each predictor variable, again by itself, not in the model is added to the model with two predictor variables and the regression is run.  In this step, customer satisfaction had the lowest p value below 0.15 and is added to the model as shown below.

Step 3: Added Promotion
Variable Coefficientt Stat p Value
Intercept-826.8-2.941 0.006
Size (Sq. Ft)0.0120.6130.543
Customer Satisfaction54.154.1070.000
Promotion594.55.1320.000

 

Since p values change as additional predictor variables are added, you have to check to see if any of the predictor variables in the model now have a p value greater than 0.15 (the p value to remove).  The store size has a p value greater than 0.15, so it is removed from the model.

Step 4: Remove Size (Sq. Ft)
Variable Coefficientt Stat p Value
Intercept-766.8-2.9330.006
Customer Satisfaction59.766.3430.000
Promotion630.76.3820.000

 

The process repeats again using this model that now contains customer satisfaction score and promotion.   Advertising costs has the lowest p value under 0.15 and is added to the model.

Step 5: Added Advertising Costs
Variable Coefficientt Stat p Value
Intercept-899.8-3.4230.002
Advertising Costs4.4561.8800.068
Customer Satisfaction45.133.7640.001
Promotion 608.86.3820.000

 

When the process is repeated at this point, there are no predictor variables not in the model that have a p value less than 0.15.  So, there are no predictor variables to add or remove and the stepwise regression is completed.  This analysis led to three of the predictor variables being included in the model: advertising costs, customer satisfaction score and whether the promotion was run or not.  The model is:

Y = -899.8 + 4.456(Advertising Costs) + 45.13(Customer Satisfaction Score) + 608.8(Promotion)

You can easily run a full regression analysis that includes only these variables to complete the analysis.

Caveats about Stepwise Regression

right or wrong figureThis process seems very inviting.  You have lots of possible predictor variables and this process adds or removes predictor variables until a single model is reached.  Pretty neat.

However, if you search the internet about potential problems with stepwise regression, you will get quite a few hits.  To me, the biggest problem (not unique to stepwise regression) is that encourages us not to think.  Here is the model.  You are done.  The problem is that the process is automated.  It can’t contain the knowledge of the subject expert.  You need to look at the model generated by stepwise regression from a practical point of view – does it make sense to the people who are the experts?

Another concern is, if there is excessive linear dependence (called multicollinearity), the procedure can end up throwing most of the variables in the model.  This can also occur if the number of variables to be tested in the model is large compared to the number of samples in your data.  Stepwise regression may not give you the model with highest R2 value (measure of how well the model explains the variation in the data).  Some even say that stepwise regression usually doesn’t pick the best model.

But, in reality, you have to use your knowledge of the process to decide if the model makes sense.  And you can always run validation experiments to confirm the model.

Summary

The month’s publication introduced stepwise regression.  This is an automated technique for building a model from a larger number of predictor variables.  The procedure is based on adding or removing predictor variables from a model based on p values.  This procedure, like any automated procedure, cannot take into account your knowledge of the process.  When you have your final model from stepwise regression, you should ensure that it makes sense to you, the subject expert.  And, you should also run confirmation runs to ensure that the model is valid.

Quick Links

Thanks so much for reading our SPC Knowledge Base. We hope you find it informative and useful. Happy charting and may the data always support your position.

Sincerely,

Dr. Bill McNeese
BPI Consulting, LLC

View Bill McNeese

Connect with Us

guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Chris Lynn

It would be fun to see this technique used to analyze the best value when buying a used car. You could create a regression model based on variables like condition, mileage, year etc using the listings on Carmax.com or whatever for a particular make & model, and use it to look for bargains based on actual v predicted price. On reflection, I guess professionals already use this…(Presumably the last column in the 'Step n' tables should be headed 'p values' not 't-stat'?)

Bill

Hi Chris,
Yes, you could use stepwise regression for selling those use cars.  Getting the data might be tough.  Yes, you are right on the column headings.  Can't believe that was missed!  Corrected now.

Scroll to Top