Real World Modeling Concerns – It’s Not about Tools
By: David Shepard
This article first appeared in Direct Magazine
From time to time there are articles in the trade press and the academic press about the relative merits of Neural Nets, Logistic regression and what I’ll refer to as relatively simple but structured RFM analysis. The usual conclusion is that given
the decision to investigate a specific number of potential predictor variables, it’s not always true that neural nets will beat regression, or vice versa. The other conclusion is that both methods allow for
consideration of more variables than RFM does, and by definition that’s true. However, if all you want to look at is RFM variables, then a simple RFM analysis may be fine, and an RFM analysis guided by a CHAID
analysis, is probably the best way to handle this option, although the 125 cell technique suggested by Arthur Hughes and others also works for many people.
In reality, the decision to use one method over the other should
be strongly influenced by other considerations, such as ease of implementation, ease of explanation, the probability of detecting errors, and the expected robustness of each alternative. Of course ease of use and
ease of implementation will greatly depend on the skill sets of the individuals doing the work. What’s easy for a trained analyst or data processing person may be next to impossible for someone without that
training. (And, in my opinion, you have to be exceptionally careful about new tools that are promised to be both more sophisticated than anything else on the market, but require no prior knowledge to operate.)
While measuring the relatively performance of alternative
techniques is fun to do and fun to read about, it really doesn’t get to the heart of the modeling problem. The core-modeling problem is not about spreading an average response rate or an average order, given a
data set based on one or more past promotions, that’s easy, it’s about developing a modeling methodology that can be adapted to a changing environment.
Some of the key questions that modelers have to answer
before choosing a tool to produce an equation or a series of equations to predict some outcome are listed below. As you go through them you’ll see that many of the questions have to do with the integrity of the
data, others have to do with your business processes and your marketing strategy. Questions about modeling tools pale in comparison to these concerns.
- Is the data you’re working with accurate, not accurate in the sense of being right to the last decimal point, but rather are the definitions
correct? Are returns really returns, are claims really claims, are customer start dates really start dates, or the date there was a conversion from one system to another, and so on.
- Is the mail file complete, or just what was available? Are all the responses or orders accounted for?
- Do the values of the predictive variables you are working with reflect the customer’s values at the time of name selection, or were they
gathered at some later date -- worse case, at or after the time the promotion was completed.
- Which independent predictor variables have an effect on the dependent variable, the event we are trying to predict? Is the effect the same
across all sub-segments of the file?
If not, and if the differences are significant, then you may require multiple models, one for each segment, or you may have to build “Interaction variables” into the model to capture the effect of the different sub-populations.
- Is the relationship between a suspected independent predictor variable and the dependent variable linear, best represented by a single straight
line, or better represented by a curved line or a series of broken lines. If you want an accurate prediction this question needs to be answered.
- Is the population on which the model will be used the same as the population from which the model was built? If, for example, it was built on
customers on the file for some time, it probably won’t apply to new customers.
- How is the model intended to be used? If customers with high scores are to mailed or called as frequently as a model might suggest, have you
correctly taken fatigue and cannibalization into account?
- Has the offer changed significantly? A model built on a soft offer may not work on a hard offer, and probably won’t if the difference is
significant.
- How will categorical variables, with lots of possibilities be handled? For example product line purchases. If you use simple dummy variables to
indicate purchase or non-purchase, when the number of categories exceeds five or six the results becomes unstable and Principal Components Analysis, where patterns of purchase can be identified and quantified,
is probably a better solution.
- If you are going to use household level demographic data, as opposed to geography based data, how are you going to handle non-matches and
missing values on matched records?
- If your business requires both a response and a conversion, should you model both separately, or try to model conversions directly?
Strategically it makes a difference.
- If you have lots and lots of data from many promotions, how do you use all of it? Average it, weight it, build seasonal models, just use the
last promotion and “forgetaboutit” There are lots and lots of choices and there are no definitive solutions.
So, these are the “real life” issues that you and you modelers
need to be thinking about. Once you answer these questions you can use any tool you like to arrive at a final equation or set of equations – provided you know how to use the tool you selected. Remember,
there’s nothing more dangerous than the wrong person using the right tool.