*Guest blog post by Vincent Granville*

We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.

MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.

MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.

However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

**The Problem**

Let's say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is up tomorrow.

You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help. I'll discuss another example (keyword taxonomy) later in this article.

**Three Solutions**

Here I propose three solutions:

**1. Sampling**

Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).

I'll soon publish an article that shows how approximate solutions (a local optimum) to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2x2 combinations (as in this correlation issue), but 3x3, 4x4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.

**2. Binning**

If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it's much faster: for each pair you get the correlation from the lookup table - no computation required, only accessing a value in a hash table or an array with 512 cells.

Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.

**3. Classical data reduction**

Traditional reduction techniques can also be used: forward or backward step-wise techniques where (in turn) you add or remove one variable at a time (or maybe two). The variable added is chosen to maximize the resulting entropy, and conversely for variables being removed. Entropy can be measured in various ways. In a nutshell, if you have two data subsets (from the same large data set),

- A set A with 100 variables, which is 1.23 GB when compressed,
- A set B with 500 variables, including the 100 variables from set A, which is 1.25 GB when compressed

Then you can say that the extra 400 variables (e.g. stocks symbols) in set B don't bring any extra predictive power and can be ignored. Or in other words, the lift obtained with the set B is so small that it's probably smaller than the noise inherent to these stock price signals.

Note: An interesting solution consists of using a combination of the three previous strategies. Also, be careful to make sure that the high correlations found are not an artifact caused by the "curse of big data" (see reference article below for details).

**Another example where MapReduce is of no use**

Building a keyword taxonomy:

**Step 1**:

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.

A="California insurance", B="home insurance", x=543, y=998, z=11

where

- x is the number of occurrences of keyword A in all the web pages that you crawled
- y is the number of occurrences of keyword B in all the web pages that you crawled
- z is the number of occurences where A and B form a pair (e.g. they are found on a same page)

This "keyword pair" table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a "keyword pair", in other words, z=0. So by ignoring these null entries, your "keyword pair" table is still manageable, and might contain as little as 50 million entries.

**Step 2**:

To create a taxonomy, you want to put these keywords into similar clusters. One way to do it is to compute a dissimilarity d(A,B) between two keywords A, B. For instances d(A, B) = z / SQRT(x * y), although other choices are possible. The higher d(A, B), the closer keywords A and B are to each other. Now the big problem is to perform clustering - any kind of clustering, e.g. hierarchical - on the "keyword pair" table, using any kind of dissimilarity. This problem, just like the correlation problem, can not be split into sub-problems (followed by a merging step) using MapReduce. Why? Which solution would you propose in this case?

**Related articles**:

## Comments