## Modeling Product Purchases

May 3rd, 2011 by: DSA

After the big three modeling variables, Recency, Frequency and Monetary Value some analysts rank Product Purchase Data as the next most important potential predictive variable. I’m not sure that its number four on the modeling hit parade, but it’s certainly in the top ten, and for some businesses ranks in the top five.

In any event, it’s an important source of customer information, and thus the question of how to deal with it. There are three or four choices:

1. Create a variable for each product and on each customer’s record code this variable a one (1) if the customer has purchased this product or a zero (0) if the customer has not purchased the product. This is called the Dummy Variable approach. So, if you have say forty products from which your customers can choose, you will set up forty Dummy Variables.

2. The second approach is similar, but makes more sense. Suppose your customers can buy from each product line, or each product multiple times. It’s intuitive that it would make more sense to still set up forty variables, but instead of coding each variable a “1” or a “0”, count the number of times each customer bought each product and enter that count into the customer’s record.

3. A slight variation of this approach would be to record the dollars spent on each product, rather than just the count of the number of purchases. This approach would make more intuitive sense if the products differed significantly in price.

4. The last method is to use a technique called Principal Components Analysis, sometimes casually referred to as Factor Analysis, or as a particular type of Factor Analysis. To keep the purists happy we’ll just call it PCA.

In a PCA of product data the idea is too capture the product purchase behavior of a customer across the range of products offered. What we’re eventually hoping to discover is whether or not the purchase, or lack of purchase of different combinations of products will give us a clue as to the future behavior of individual customer, or of groups of customers, if we are doing the analysis at the source key or at some geographic (zip code) level.

Without getting too technical (you can skip this paragraph if you like) the PCA program creates a new set of Principal Component Variables and related Principal Component Scores that can be used later on in a regular or logistic regression prediction model.

Again, lets assume we’re working with forty product lines and we know how many times each customer has purchased each product, the program will initially generate forty Principal Components, but each PC will contain a different amount of information. In general, maybe four to eight of the forty Principal Components will contain most (70% or more) of the information contained in the entire set of PC’s. And these four to eight PC’s can be used in regression modeling just like any other “continuous” variable: Recency, Frequency, Monetary Value, Income, Age, etc.

Obviously it takes some time to transform raw product purchase data into Principal Components that can be used in scoring models, and the scoring procedures will become more difficult and so on. So, the question is, is it worth the extra effort to convert product purchase data into Principal Components?

To help answer this question we’ll look at some recent modeling results and you can decide for yourself. The problem was to predict the lifetime value of different customer groups based on all available customer data, including product purchase data.

As described above, we isolated and modeled just the product purchase data three (3) ways: (1) Using simple Dummy Variables, (2) Using Counts of the number of times each product line was purchased, and (3) Principal Components.

The quick answer is that the model built on just Dummy Variables had an R-Squared of 11% (the model explained 11% of the variation in lifetime value), the Count Approach had an R-Squared of 41% and the Principal Components method produced and R-Squared of 53%.

In addition, looking at a Decile Analysis of the Residual Errors (Table 1 below) produced by each approach argues for the Principal Components method over the Counting method, and most important, the use of simple Dummy Variables is shown not to be very effective in this type of application.

Table 1

Average Error In Each Decile For Three Modeling Techniques