When linear regressions fail (and what to do about it)

Summary: Machine learning for marketers

In this post we argue that direct marketers need to get familiar with machine learning. You can think of machine learning as black-box predictive algorithms that are more sophisticated, and require more computing power, than linear regression. We’ve open-sourced the data and code for this study so that you (or your analytics team) can replicate our analysis. Here are the key takeaways: 

  • Machine learning can improve your marketing campaign ROI by 15 – 50%. If you’re not using machine learning for your predictive analytics stack you’re leaving money on the table.
  • Conventional regression-based forecasting is especially vulnerable to breakdown when used as a predictive response modeling tool. In fact, forecasts with linear models can do more harm than good. 
  • By randomly sampling both the rows and columns of a training data set multiple times analysts can develop a series of linear models that substantially improve on the performance of a single regression. 
  • Random forests (a classic machine learning algorithm) significantly outperform linear models.  

If you need more convincing, read on. If you’d like to talk with our team about how you can start benefiting from machine learning today check out our website or get in touch with us at info@polhooper.com or +1 (619) 365 – 4231. We will replicate the analysis presented here gratis on your own past campaign data to prove the value of this concept.

Study methodology

The data set used for this study is an anonymized version of actual customer data. Conventionally, analysts validate and tune models using random samples of the training set. However, we want to know how models trained on past data perform when applied to future campaigns. To test this we took the first 50% of marketing transactions recorded to date in 2015 and used the predictive models trained with that data to predict outcomes in the subsequent 50%. 

We present two graphics that should be familiar to most marketers, lift charts and cumulative gains charts

All models are calculated using the scikit-learn python library which, along with vowpal wabbit and R, comprises the bulk of our predictive analytics stack here at Polhooper.

The trouble with normal linear regression

Data sets have two types of correlations. The first is “signal,” real relationships between the variables that apply broadly to the problem being studied. The second source of correlation is “noise,” relationships between variables that, while statistically significant in a particular data set, are not generally valid. The trouble with regular linear regression is that it fits both the signal and the noise in a set of training data. When used to forecast outcomes in new data, this tendency to fit to noise can lead to worse-than-random prediction performance. 

To improve the performance of the base linear regression we use a technique called bootstrap aggregation. By randomly sampling both the rows and columns of a training data set, training a linear model, and repeating this process multiple times we can improve model performance by reducing the tendency of a single model to fit on noise in the data. With bootstrap aggregation we build a whole bunch of models with subsampled data and average their predictions together to get a single, final prediction. 

(Aggregating a bunch of linear models looks something like this)

For our ensemble we trained 250 linear models, using 50% subsamples of the original data and 25% subsamples of the original feature set for each model. Here’s what the gain and lift charts, respectively, looked liked for the simple linear regression (green) compared to the bagged regression ensemble (purple):


Key take-aways:

  • A single linear regression fit on the whole data set (green line),  actually becomes worse than random at selecting qualified leads, depending on how many leads you contact.
  • The ensemble of linear models with the randomly-sampled data improves prediction performance by 15 – 20% over the single linear model.
  • In general, linear regression isn’t that great at forecasting customer engagement for this problem.

Adding in some machine learning

Random forests are a classic and very powerful machine learning algorithm (theory nerds can check out the original paper by Breiman, 2001). The algorithm uses the same process that we described above for linear models, except instead of training linear regressions it trains a bunch of classification and regression trees. We added a random forest model (yellow line) and a logistic regression (orange line) to the lift and gain analysis for comparison:


Key take-aways:

  • A random forest improves prediction performance over the second-best model by 15 – 30%, depending on the number of leads contacted. This is a 30 – 50% improvement over a single linear model. 

Putting this in dollars and cents terms, let’s say that in a given month this customer planned to contact 200,000 out of 1,000,000 of leads available, and to select those leads with a customer response model. Also assume that 20% of contacted leads convert into customers, and that the lifetime value of a customer averages $1,000. In this case, machine learning-based targeting grosses $800K in revenue, while the logistic/linear ensemble grosses $676K and the linear regression grosses $568K.

In other words, not using machine learning for targeting in this example would have cost the company $124 – 232K!


Machine learning makes predictions better.

Better predictions mean better targeting and stronger marketing ROIs.

If you or your data provider aren’t using machine learning you’re missing out.

To start a conversation check out our website or get in touch with us at info@polhooper.com or +1 (619) 365 – 4231.

We’ve open sourced the data for this study via Dropbox and the code via our GitHub account. Check out the README file before beginning for instructions on setting things up. Email us with any questions, suggestions, or, especially, results that are better than what we’ve put out here.


When linear regressions fail (and what to do about it)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s