This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.
In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be
- entirely caused by chance (just like someone who wins at the lottery wins purely by chance) and
- not replicable,
- having no predictive power,
- but obscuring weaker patterns that are ignored yet have a strong predictive power.
The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?
Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.
To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.
Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.
Let's introduce a new parameter, denoted as m, representing the number of paired (bi-variate), independent time series selected out of the set of k time series at our disposal: we want to compute correlations for these m pairs of time series. Theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being e.g. a simulated normalized stock price at a given time for two different stocks), what are the chances that among the m correlation coefficients, at least one is higher than 0.80?
Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, let's assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically, that is approximately when n is not small) normal with mean = 0 and variance = 1/(n-1). The probability that r is larger than a given large number a (say a=0.80, meaning a strong correlation) is p=P(r>a) with P representing a normal distribution with mean = 0 and variance = 1/(n-1). The probability that, among the m bivariate (paired) time series, at least one has of correlation above a=0.80 is thus equal to 1-[(1-p)^m], that is, 1 minus (1-p) at power m.
- If n=20, m=10,000 (10,000 paired time series each with 20 observations), then the chance that your conclusion is wrong (that is, a=0.80) is 90.93%.
- If n=20, m=100,000 (still a relatively small value for m), then the chance that your conclusion is VERY wrong (that is, a=0.90) is 98.17%.
Now in practice the way it works is as follows: you have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations, that is m = k*(k-1)/2. However the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a very high value such as a=0.90. But also, your data is not random numbers, it's not white noise. So the theoretical correlations are much above absolute 0, maybe around 0.15 when n=20. Also m will be much higher than (say) 10,000 or 100,000 even when you have as few as k=1,000 time series (say one time series for each stock price). These three factors (non independence, theoretical r different from 0, very large m) balance out and make my above computations still quite accurate when applied to a real typical big data problem. Note that I computed my probabilities using the online calculator stattrek.
Conclusion: hire the right data scientist before attacking big data problems. He/she does not need to be highly technical, but able to think in a way similar to my above argumentation, to identify possible causes of model failures before even writing down a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of stats. Being a computer scientist also helps to scale your algorithms and make them simple and efficient. Being a MBA analyst also helps to understand the problem that needs to be solved. Being the three types of guy at the same time is even far better. And yes these guys do exist and are not that rare.
Let's say you have 3 random variables X, Y, Z with corr(X,Y)=0.70, corr(X,Z)=0.80. What is the minimum value for corr(Y,Z). Can this correlation be negative?
- Correlation and R-Squared for Big Data
- Eight worst predictive modeling techniques
- Another example of misuse of statistical science
- The curse of dimensionality (it got worse with big data)
- Data Science eBook
- Data Science Apprenticeship
- Debunking lack of analytic talent
- Causation vs. Correlation
- Data Science dictionary
- How and why to build a data dictionary
- Data Science tools
- A new random number generator
- Assessing efficiency of approximate vs. exact algorithms (coming soon)
- Statistical comic strip