Big Data question: what to save for how long?
March 26, 2011 Leave a comment
As tools in the big data world emerges and mature, question is how much of the data to save in high versus low resolution. The answer depends on the uses of this data. Recently, i’ve had lunch with someone from Yahoo, where they were doing modeling on full-resolution data and claimed that you need big-data tools (hadoop, mahout) to build predictive models.
The problem with predictive algorithms requiring more data only arises if the number of independent variables that are predictive is large. Higher number of variables require larger datasets to train classification models (see Richard Bellman’s curse of dimensionality, the godfather of dynamic programming). In any case, the big data gives us a 1-2 orders of magnitude higher processing power, which only allows for a few more variables, as the volume of data required increases exponentially with new variables.
Perhaps the more important question to ask is why we need and how much data we need to do what we need to do. In our focus, we provide marketing analytics to our clients, so our focus is marketing. In the case of mining web analytics logs, there are apparently four uses
- Revenue Attribution
- Modeling
- Triggering marketing actions
- Building temporal statistics on customer actions
These four topics require data to be saved for various
- Length of time
- Resolution
Here is a simple depiction of the uses by resolution and data retention.