David Shepard Associates, Inc. Database Marketing Consultants (Marketing Strategy, Analytics & Statistical Models, Marketing Database Systems)
Exceptional experience ...
    Exceptional results.
 

Separate Models for Separate Segments?

By: David Shepard

This article first appeared in Direct Magazine


One of the ways in which you can improve your modeling results is to look for segments within your customer database that have different relationships to potentially predictive variables such as  Recency, Frequency, Monetary Value and Products purchased.

The trick is to determine if the strength of the relationship is equally strong across all segments, or whether the strength of the relationship differs from segment to segment.

For example, lets suppose you believe that your sales are correlated with two variables, will call them variables X1 and X2. What you might do is ask your statistician to draw a sample of data, create a Scatter Diagram so that you can see the relationship and calculate the Correlation Coefficient so that you can quantify the relationship as well as visualize it. We did that for a dataset we created for this article.

So far so good. Your hunch was correct your sales (Y) are positively correlated with X1 and also with X2. And while the correlation statistics are not great (.7 to .9) they are not weak (.1 to .3) either. They are moderate, .45 and .64. (The absolute value of a correlation coefficient can not be less than 0 or more than 1.)

Now that you’ve discovered two variables that are related to sales you would want to build a two variable regression model of the form Y = A +b1X1 + b2X2.  Using the same data set that produced the above results you have your statistician run the data through the a Regression procedure and produce the following results.

Y = 31.5 + 9.2*X1 + 6.7*X2 with an R-Squared of 59%.

Not Bad. Our simple two variable example produced an equation or a model which explains 59% of the difference we see among our customers’ behavior.

Suppose it now dawned upon you that while sales of your customers were correlated with variables X1 and X2, your customer file was really made up of three distinct segments: that you call: Young, Middle and Old and that you suspect that the relationship between sales and X1 and X2 might not be the same for each segment.

What could you do? Since you’ve identified three segments you could use this information in your model. How? Have your statistician create two new “Dummy Variables” and code your young customers DY and your middle aged customers DM. You don’t need to code your old customers DO, because if they are not Young (Coded DY) or Middle (coded DM) then they must be in the segment called Old. Your statistician runs the data through the regression program again and arrives at the following equation:

Y = 428 + 8.4*X1 + 7.6*X2 – 539.5*DY – 804.4*DM
and R-Squared goes to 86%
.

Your hunch was correct each segment has a different relationship with X1 and X2. Your statistician now suggests that the results could be improved even more if we looked for the interaction between the segment identifiers and the individual variables themselves. You have no idea what this means but it sounds good so you try it and this is what you come up with.

Y = 4 + 7*X1 + 13*X2 –1*DY +1*DM -2*DY*X1 –5*DY*X2 +4*DM*X1 -10*DM*X2 and R-Squared =100%

What happened? What happened is that we discovered, in our made up example, that each segment behaves differently with regard to variables X1 and X2. And, that by understanding the relationship between X1 and X2 and sales in each segment we were able to build, in this artificial case, a perfect model! Of course in real life you will never be able to build anything close to a perfect model.

But the lesson to be learned is that if you suspect that different demographic or lifestyle or attitudinal segments might display different relationships with regard to your key performance variables, try building separate models for each segment.

Building separate models, as opposed to building one equation with all dummy and interaction variables, as we did above, is a simpler solution and one that is more likely to be understood and less prone to implementation errors. (If you’d like a copy of the dataset used in this example, just e-mail me at dshepard@dsadirect.com