Working To Build Better Predictive Models (Pt 2)
Wednesday, August 3rd, 2011
In the first part of this discussion we outlined ways to increase the number of available predictor variables. Of course, what’s needed next is a repeatable process for identifying key variables from the host of variables that appear on our databases. Here statistical techniques like “correlation tables” and simple cross tabs, which show the relationship between potential variables and response can help. And, of course, the marketing people should always tell the modeler which variables they either know or think to be significant predictors.
However, we think the best technique for identifying potential variables is CHAID.
CHAID can be used to pictorially display the differences in response rates looking at each potential variable, one at a time. When used in this manner, the marketing person is on an equal footing with the analyst or statistician, because the results, with just a little bit of explanation, are so easy to understand. (Whether CHAID should be used beyond this point as a replacement for a regression model is a subject we won’t get into here.)
Needless to say, a CHAID can’t be done for every conceivable potential variable, so some combination of judgement and reliance on the correlation table will be required in this initial variable selection process.
Now, let’s assume for the purpose of this discussion that we identify 20 to 30 or even 50 variables, other than the basic RFM variables, that are each individually related to response. The last thing in the world we would want to do is use all of them in a model at the same time. The model would so “overfit” the data that while a Decile Analysis of the Calibration sample (the sample upon which the model was built) or even the Validation sample (the hold-out sample intended to prove the validity of the model) would look wonderful, the results of the model would never be replicated upon roll-out.
To at least some degree, this is a danger you never have to worry about, because the programs that produce regression models, if used correctly, will prevent this from happening. But, what may happen is that these very same programs (Step Wise Regression Programs) will frequently produce models that contain “too many” variables – even though the statistics describing these variables will suggest that they are significant.
When this happens, even though the Decile Analysis done on the Validation sample will look good, the model will have less than an optimum chance to hold up on roll-out promotions. To prevent this from happening, or to at least reduce the chances of this happening, we suggest ”pruning away” the least significant of the
significant variables and observing the effect on the Decile Analysis.
If the Decile Analysis is not significantly affected (made worse) than drop the variable, and as often as not you will find that dropping the unnecessary variables actually improves the Decile Analysis – increase the spread and removes “bumps” in the model. If all of these steps are followed, you will have a good chance of replacing your RFM models.