
To Straighten Or Not To Straighten That Is the Question
By: David Shepard
This Article first Appeared In Direct Magazine
If you’re a marketer who uses or commissions regression models you need to understand the topic of non-linearity, what is it, why is it important, how it could improve your models, and why it doesn’t happen automatically. This article will address all of these issues.
If you’ve built or used regression models to predict response or sales you know that a regression equation looks like this:
Y = a +b1*X1 + b2*X2 + b3*X3…bn*Xn
In this equation Y is the “thing” you’re trying to predict (the dependent variable) and the X’s represent the “things” (independent variables) you know about your customers or prospects that allow you to make the predictions. Typical independent variables include performance indicators such as recency, frequency and dollar sales; demographics such as age, and income, and promotion history, such as the number of times called, etc.
The” b’s” are called regression coefficients and you can think of them as weights assigned to each variable in the model, the assignment is generated by a regression program. The bn*Xn notation simply means that there could be up to some number (n) of variables in the model. The “a” is a constant that we can skip over for now.
The job of the statistician, working with a particular dataset, such as the results of a past promotion, is to discover which independent variables have a significant effect on the dependent variable and then feed this information to the regression program which will produce the regression equation.
One of the keys to a “good” long lasting model is to find the right set of predictive variables given the hundreds if not thousands of potential predictors from which to select. But, in addition to finding the right variables it’s important to determine if the relationship between a predictor variable such as AGE and the a dependent variable such as SALES is best described by a simple straight line relationship, or whether some other “non-linear” relationship makes for a better, more accurate prediction.
When a non-linear relationship exists, it’s the job of the modeler to try different transformations of the data to determine the best fit. You as a user can tell if this has been done in one of your models if you see something like this:
Sales = a +b1*Log of Recency +b2*Square Root of Prior Sales
What this equation tells you is that the modeler determined that that relationship between Sales and Recency is best described by replacing Recency (number of months since the last purchase) by the log of Recency, and that the relationship between Sales and Prior Sales is best described by replacing Prior Sales by the Square Root of Prior Sales.
Exhibits 1 and 2 show how the log transformation works to straighten the relationship between Sales and Recency.. Exhibit 1 is a plot of Sales against Months Since Last Purchase, Exhibit 2 is a plot of Sales against the Log of the Months Since Last Purchase. The Log transformations straightens the data and results in a better fit as indicated by the R Squared value of 1 versus an R Squared value of .86 for the original or un-transformed data.
Exhibit 1

Exhibit 2

If nothing else, the above equation (with transformations) certainly looks more impressive than the equation below, without the data transformations.
Sales = a +b1*Recency + b2*Prior Sales
But apart from looking impressive, the real question is: does finding the right shape of a relationship, correcting for non-linearity, or straightening, three different ways to say the same thing, really make a difference?
To answer this question we created two data sets. Each data set has 400 observations representing 400 customers, each of whom responded to a mailing and purchased some amount of product. As is customary, the first data set will be used to build the model the second to test or validate the model.
But, to make sure that we could prove our point we cheated. Instead of searching for variables that had a non-linear relationship with sales, and developing an equation, we started with the correct model!
In the Correct Model each customer’s sales is determined by this formula
Sales = 75 –30 times the log of the number of days since last purchase + 5 times the square root of Prior Orders +.5 times the exponential value of Prior Sales/million + 6 if age is greater than 45 + a random error that ranges between –50 and +50.
To determine the effect of correcting for non-linearity we simply ran the data through an Excel spreadsheet and had the program calculate a regression model, using the four variables (Recency, Orders, Prior Sales and Age) but with no attempt to incorporate their known non-linear relationships.
The program produced the following equation.
Sales = 64 - .58*Recency + .20*Orders +2.61*Prior Sales + .085*Age
The Model had an R Squared of 33%. (In other words the simple model explained 33% of the variation in Sales.
Then we ran the data through the program again, this time substituting the correct form of the relationship for the original uncorrected data.
The same program produced the following equation.
Sales = 84 –29.29*the log of the number of days since last purchase + 4.39*square root of Prior Orders +.48*the exponential value of Prior Sales/million + 4.05 if age is greater than 45
The Model’s R Squared was 79%. (Even though we knew the correct form of the only four variables affecting the model, the model was not perfect because of the random error.
So, it would appear that knowing the correct shape of the relationship between independent variables and the dependent variable makes a huge difference—at least to a statistician, but how about the difference it makes to a direct marketer.
To answer this question we applied both models to our second data set of 400 different customers and produced the two decile analyses shown in Tables 1 and 2.
As you can see by comparing Tables 1 and 2, the Correct Model results in a greater spread and a closer fit and is therefore the better model. But don’t draw the wrong conclusions from this example. In the real world the search for the correct relationship is not done just to get a better fit. In fact that is a relatively weak reason for going through all the work that it takes to find and correct for non-linearity. In the real world, many relationships are so non-linear that these important variables will not appear in a regression model at all… unless their non-linearity is first identified and then corrected for.
Why is that? Because the regression programs are expecting linear relationships and a relationship that is in fact very strong, but very non-linear may be missed entirely by an analyst just running data through a regression program. (And, most importantly, the regression programs don’t do this automatically by themselves, this work has to be done by an analyst working with the data.)
So, how does the analyst discover these non-linear relationships? By using a number of graphical techniques and/or CHAID. The lesson for the direct marketer is that these non-linear relationships exist. We find one or two in nearly every model we do. If you don’t see them in yours, that does not mean they are not there, they just may have been overlooked and your models could be significantly improved.
One last note, correcting for non-linearity is a central part of what statisticians call Exploratory Data Analysis (EDA). This practice is recommended even when the modeling technique does not assume that the relationships it’s being asked to analyze are linear. For example, artificial neural net solutions do not assume linear relationships. Nevertheless, straightening complicated non-linear relationships prior to submission of data to the neural net is a commonly recommended procedure. It makes it easier for the Net to arrive at a reliable solution, and there’s nothing wrong with that.
If you would like a copy of the data sets that were developed for this article just send an e-mail to me at dshepard@dsadirect.com