Big Data question: what to save for how long?

As tools in the big data world emerges and mature, question is how much of the data to save in high versus low resolution.  The answer depends on the uses of this data.  Recently, i’ve had lunch with someone from Yahoo, where they were doing modeling on full-resolution data and claimed that you need big-data tools (hadoop, mahout) to build predictive models. 

The problem with predictive algorithms requiring more data only arises if the number of independent variables that are predictive is large.  Higher number of variables require larger datasets to train classification models (see Richard Bellman’s curse of dimensionality, the godfather of dynamic programming).  In any case, the big data gives us a 1-2 orders of magnitude higher processing power, which only allows for a few more variables, as the volume of data required increases exponentially with new variables. 

Perhaps the more important question to ask is why we need and how much data we need to do what we need to do.  In our focus, we provide marketing analytics to our clients, so our focus is marketing.  In the case of mining web analytics logs, there are apparently four uses

  1. Revenue Attribution
  2. Modeling
  3. Triggering marketing actions
  4. Building temporal statistics on customer actions

These four topics require data to be saved for various

  1. Length of time
  2. Resolution

Here is a simple depiction of the uses by resolution and data retention.

Determining how much to keep after then initial 90 days or so depends on the modeling uses.  If the models being built have a natural 3-4% response rate, you need data that is approximately double that, so you are properly representing negative outcome events (actually oversampling success events).  This level of data retention is enough for doing most propensity and event modeling exercises, since the data is actually pretty large.

How to forecast using customer ensemble dynamics

Forecasting sales always come up as CFOs continue to push for accountability from Marketing departments.  From a marketing perspective, forecasting provide focus, goals and budgets.  At a high level, marketing departments goal to acquire, grow and retain customers map one-to-one to sales forecast.  Looking at forecasting from customer lens (causal or ensemble) rather than a time-series (non-causal) uncovers causal reasons behind results, hence providing metrics to monitor and correct.

For example, number of customers, order per customer, revenue per order and margin% per order are simple factors that yield the sales, each of which are metrics a marketer knows and manages.  The multiplication of which (over time) yields a revenue and margin forecast.

In a B2B environment, predictions are easier, based on quota of each salesperson and their book of business.  In B2C, the challenge is different, since there are no official account assignment that happens and account management is done at a macro level by marketing.  So how do you forecast based on an ensemble of customers.

One of the answers is ensemble forecasting, based on similar problem of forecasting weather patterns.  Given the weather and underlying dynamics at a point in time, the models would forecast the next period, which is then iterated over time.

Ensemble forecasting tries to predict the future state of a dynamic system.  In this case, the dynamic system is a collection of customers, each with a different buying pattern and relationship with the company.  New customers, high value customers, single-category buyers, all exhibit different behavior.  Forecasting  individual behavior is a common area of modeling usually referred as response modeling.  Here the idea is to predict how an ensemble of individuals would behave. 

Ensemle forecasting is a numerical method that is a form of Monte Carlo method that utilizes probability distributions and varying initial conditions and external assumptions that produces accurate results.

In the specific case of forecasting customer dynamics and sales, new customer acquisition and retention rates are probabilistic inputs to the system.