What Direct Marketers Need To Know About Segmentation (Part3)
June 9th, 2010 by: DSAPart 3 of a Multi-Part Series
In our last article we concluded that segmentations based solely on demographic and behavioral data were relatively easy to build (using samples drawn from the customer file) and that it was relatively easy to project the results of the segmentation to the
entire customer database.
Relatively easy as compared to what?
Relatively easy compared to segmentations based on surveys that attempt to get at the reasons why customers behave as they do. And, we argued that while survey based research was extremely valuable, it was no means certain that we could find correlations between attitudes and behavior, and if such correlations did not exist, it would then be difficult if not impossible to accurately assign all of the customers in one’s database to the segments discovered by the research.
In this article we’ll start looking into the methods used to create segmentations, putting aside for the moment the assignment issue raised above. We can break the task down into two sub-tasks: (1) selecting the data that will be used in the segmentation and (2) the methods that will be used on the data to create the segments.
The Data
There are always two schools of thought about data and data mining. One school argues for letting whatever data mining tools you choose look at all of the available data and examine the patterns discovered by the process. The second school argues, especially with regard to segmentation, that one should first determine the variables upon which the segmentation should be based, and exclude variables you don’t want to see included in the segmentation solution.
Demographic Data
For example, demographic data generally includes such variables as: age, income, and years of education, presence of children, occupation, home value, and time traveled to work, ethnicity, marital status and so on. Members of the first school would argue to throw all of these variables into the modeling pot and see what comes to the top; members of the second school would argue that if you were not interested in say, “presence of children”, then leave that variable out, if you were interested in it, leave it in, and if you were really very interested in ‘presence of children”, not only leave it in but make sure that it is weighted more heavily than other demographic variables (the tools allow for this).
Behavioral Data
This idea of deciding which variables are of interest and which are of particular interest applies to behavioral data as well. Think about these performance variables: Total sales to date, total sales within the last six months, total number of returned items, returned items as a percent of shipped items, average dollar sales, average time between sales, percent of sales at discounted price…the list goes on and on. Which ones do you want to include in your segmentation, which ones deserve more than average weight. The decisions you make will determine the composition of the segmentation.
The “User” Meeting
If you think that you are a member of the “let’s think about what we want to segment on”
school then what you’ll probably want to do is to sit down with all of the potential users of the segmentation and decide which variables are in and which ones are out, and if any of the included variables should receive disproportionate weight, and if so what should the weights be. This is one of those situations in which no two people will agree, and the same person might choose a different solution on a different day.
The Problem of Co-linearity
Let’s assume the user meeting is over and we’ve selected all of the demographic, behavioral and survey variables we want to consider in our segmentation.
The most immediate problem we have to face comes under the name of co-linearity or
multi-co-linearity. The problem is that demographic and behavioral variables tend to be correlated with each other (income and home value, occupation and years of education are all correlated) and the tools that we will use later on to create the segmentation are based on the assumption that the variables are not correlated, and if they are, the variables that are correlated with each other will tend to receive more weight in the final solution than un-correlated variables…an unintended consequence, to be avoided if possible.
Principal Component Analysis
One obvious solution is to use only uncorrelated variables, but who wants to pick among the four demographic variables mentioned above only one variable and leave the other three out entirely. The better solution is to use a procedure called Principal Component Analysis (PCA) often (to the consternation of the purists) called Factor Analysis. (We won’t go into the differences.)
In a PCA all of the potential variables defined at the users meeting are “thrown into” the PCA pot. Suppose for the moment we had selected ten demographic variables, twenty behavioral variables and fifteen survey variables for a total of 45 variables. Out of the PCA would come 45 principal components — 45 new and uncorrelated variables.
Equally if not more important than the elimination of co-linearity is the fact that each of the 45 principal components will have different “weights” or contain different percentages of the total amount of information contained in the entire data set. For example, the first principal component may contain not 1/45th of the total amount of information, but maybe 20% to 30% of the total information, and the first seven or eight principal components may contain 70% to 80% of the total information, with each of the remaining principal components containing less than their proportionate share of information.
So what we have accomplished through the use of Principal Components Analysis is the
elimination of co-linearity and the reduction in the number of variables that will now go into the final step of a segmentation project—Cluster Analysis which we’ll discuss in the next article.