
Working to Build Better Regression Models
By : David Shepard
This article first appeared in Direct Magazine
In last month’s column I noted that I was surprised that less than 30% of the companies surveyed in the DMA Price WaterHouseCoopers suvery of CRM practices used regression models. By way of contrast, close to 50% were using RFM models. If regression is really a better tool, for no other reason than the obvious observation that regression models can call on variables other than RFM, why this disparity?
I don’t know. But, part of the answer may have to do with modeling attempts that did not work, or did not work better than RFM.
For starters it should be clear that in order for a regression model to “work better” than a RFM model, the regression model has to incorporate variables other than RFM variables that aid in the prediction of the dependent variable.
To keep things relatively simple, let’s just concentrate on response models, because most RFM models are used to predict response. Let’s further stipulate that for the purpose of this discussion to “work better” means to improve the “Lift”, or the ratio of responders to names promoted at some agreed upon depth of file. For example for a regression model to “work better” than an RFM model at a depth of say 30% of the file, the regression model would have to identify significantly more responders than a RFM model would have identified at the same depth. Also, the argument that it’s easier to score a file with a single regression equation than it is to manage a RFM process, won’t count in this discussion – even though it’s true.
So, we get back to question of identifying more variables, variables other than RFM variables (Recency of purchase, Frequency of purchase and some measure of Monetary Value).
One way to do this is simply to create new variables out of FM variables. For example, variables such as: the total number of purchases or total sales divided by months on file or divided by the number of times promoted.
Another key variable that frequently appears is Tenure, or the length of time a customer has been on the database. This is such an important variable that it is frequently the basis for creating separate models, one for relatively new customers, and one or more models for customers that have been on the file a longer period of time.
Then there is product purchase data, which particular products or product categories has the customer purchased. This variable can be handled through the use of “dummy or 0/1 coded variables”. And, as we have mentioned in past articles, the best way to handle this data is through the use of Principal Components Analysis, a techniques which gets at the pattern of purchases over the entire set of purchase possibilities.
Of course, what’s needed most is a repeatable process for identifying key variables from the host of variables that appear on our databases. Here statistical techniques like “correlation tables” and simple cross tabs, which show the relationship between potential variables and response can help. And, of course, the marketing people should always tell the modeler which variables they either know or think to be significant predictors.
However, we think the best technique for identifying potential variables is CHAID.
CHAID can be used to pictorially display the differences in response rates looking at each potential variable, one at a time. When used in this manner, the marketing person is on an equal footing with the analyst or statistician, because the results, with just a little bit of explanation, are so easy to understand. (Whether CHAID should be used beyond this point as a replacement for a regression model is a subject we won’t get into here.) Needless to say, a CHAID can’t be done for every conceivable potential variable, so some combination of judgement and reliance on the correlation table will be required in this initial variable selection process.
Now, let’s assume for the purpose of this discussion that we identify 20 to 30 or even 50 variables, other than the basic RFM variables, that are each individually related to response. The last thing in the world we would want to do is use all of them in a model at the same time. The model would so “overfit” the data that while a Decile Analysis of the Calibration sample (the sample upon which the model was built) or even the Validation sample (the hold-out sample intended to prove the validity of the model) would look wonderful, the results of the model would never be replicated upon roll-out.
To at least some degree, this is a danger you never have to worry about, because the programs that produce regression models, if used correctly, will prevent this from happening. But, what may happen is that these very same programs (Step Wise Regression Programs) will frequently produce models that contain “too many” variables – even though the statistics describing these variables will suggest that they are significant.
When this happens, even though the Decile Analysis done on the Validation sample will look good, the model will have less than an optimum chance to hold up on roll-out promotions. To prevent this from happening, or to at least reduce the chances of this happening, we suggest ”pruning away” the least significant of the significant variables and observing the effect on the Decile Analysis. If the Decile Analysis is not significantly affected (made worse) than drop the variable, and as often as not you will find that dropping the unnecessary variables actually improves the Decile Analysis – increase the spread and removes “bumps” in the model. If all of these steps are followed, you will have a good chance of replacing your RFM models.