
The Bayesian Alternative - Or Another Way to Skin the Modeling Cat
By David Shepard
This Article First Appeared in Direct Magazine
If you’ve been following this column over the last year or so you know that I’ve become somewhat obsessed with the issue multiple models. For those that haven’t been paying close attention, meaning just about anyone with a life, here’s the problem. You want to build a model to predict some outcome, a response to a cross-sell mailing, attrition, lifetime value…whatever.
You’d like to come away with one simple to use equation that can be applied to your entire customer file. But you intuitively know that this might not be possible or at least easy. For example, Tenure, how long someone has been your customer, is certainly an important variable, but how does it relate to the other variables in you model? Consider this, demographics may be important predictors, for customers that have been on your file for just a few months, but will they be important predictors, or as important predictors, for customers that have been on the file for years, and about whom you have lots of transaction information/variables?
Or, how about this: an increasing percentage of your customers are now coming in over the Web. Will they behave the same way your traditional customers behave? Probably not. In fact not all of your traditional customers behave the same way…they differ among themselves depending upon the traditional medium that they came from: direct mail, print, broadcast, package enclosures, and so on. Not only that, but customers that were attracted to you because they received a traditional/direct mail promotion may now respond via direct mail, or via, the phone, or via the Internet. Will they all behave/respond the same way? Maybe not.
One more example. You use, I’ll make it up, six (6) different offers, and customers from the same major medium behave differently, depending upon the offer they came in on.
So, what’s the problem? Can’t we use the usual set regression modeling strategies to deal with these complications? Why can’t we just use lots of dummy variables to identify the different major media promotion sources, another set of dummies to identify response vehicles, another set of dummies to represent offers, and even break up a continuous variable like Tenure, into a categorical variable to represent different lengths of time on the file…and then to finally correct for all of the ”interactions” (e.g., the importance of age depends upon something else, gender or time on file) we’ve identified, cross multiply all of the variables in the model.
We’ll we could but it gets very messy very fast. The other obvious alternative, create separate models for separate segments could also work, provided the size of the segments were large, and as we keep dividing up the populations, the size of the segments get small pretty quickly.
Which gets us, finally, to the Bayesian Alternative. The Bayesian solution requires the analyst in conjunction with the marketer to identify all the thought to be important segments in the dataset that will be used for modeling, and also to identify all of the variables that are thought to be important to all of the segments.
For example, Age might be thought to be important across all segments, but its importance (predictive power) may vary across segments -- important for men, but somewhat less important for women, important for customers responding over the Internet, but not as important for customers responding by mail. (When you read less important, what I mean is that the regression coefficients should be different for each group.)
Another Example: The offer to which the customer responded -- very important in predicting the behavior of new customers -- not as, if at all, important in predicting the future behavior of customers on the file more than 12 months.
Or, suppose, your file has been already segmented by some combination of demographics and behavior data—these segments may be represented in a Bayesian Model as separate categories, meaning that variables such as age, recency of purchase, amount spent per months on file, all of these strong individual behavior variables, are important predictors across all segments, but how they are different at the individual level depends on the segment
Ok, now suppose you’re agreed to break your file into, I’ll make it up, eight (8) different categories, and you identified a couple of dozen individual predictor variables (continuous or categorical) that you think will be important predictors across all or at least many of the eight categories (you can call them groups or segments, if you prefer).
The next step is to run a Bayesian Regression Model (sometime these are referred to as Mixed Models, or Random Effects Models). Through the usual modeling procedures you’ll prune down the number of predictor variables (using both statistical methods and decile analyses of the validation sample) to a significant few – say five (5) for the purpose of this article.
Here’s the difference. You won’t get five (5) regression coefficients. You’ll get eight (8) sets of five coefficients. What’s happened, in effect, is that the process has created five (5) regression coefficient’s that fit the entire dataset, but then the process goes back and “tinkers” with the set of regression coefficients, producing the best set for each category.
What are the benefits? For starters, you have only one model, as opposed to one of the alternatives of creating one model for each category, and the number of categories can get very large very quickly, when you start to think of all the ways to slice and dice a dataset. What’s more the “academic literature” on the subject suggests that this methodology produces better fits than what would have been achieved with separate models. And, when compared to the alternatives that employ large number of dummy variables and lots of interactions between the dummies and the other variables in the model, the results are easier to interpret and score…and from a practical perspective may be more accurate since there is less chance for error.
Finally, I’d like to thank Ed Malthouse at North Western who first pointed me in this directions, and my associates Asim Ansari and Rajeev Kohli at the Columbia Business School, who took the time to explain the process to me – you can’t make this stuff up.