Subscribe to our Newsletter

Featured Posts (70)

Guest blog post by Michael Walker

The Internet of Things (IOT) will soon produce a massive volume and variety of data at unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's soul.


Let's define our terms:


Internet of Things (IOT): equipping all physical and organic things in the world with identifying intelligent devices allowing the near real-time collecting and sharing of data between machines and humans. The IOT era has already begun, albeit in it's first primitive stage.
Data Science: the analysis of data creation. May involve machine learning, algorithm design, computer science, modeling, statistics, analytics, math, artificial intelligence and business strategy.
Big Data: the collection, storage, analysis and distribution/access of large data sets. Usually includes data sets with sizes beyond the ability of standard software tools to capture, curate, manage, and process the data within a tolerable elapsed time. 
We are in the pre-industrial age of data technology and science used to process and understand data. Yet the early evidence provides hope that we can manage and extract knowledge and wisdom from this data to improve life, business and public services at many levels. 
To date, the internet has mostly connected people to information, people to people, and people to business. In the near future, the internet will provide organizations with unprecedented data. The IOT will create an open, global network that connects people, data and machines. 
Billions of machines, products and things from the physical and organic world will merge with the digital world allowing near real-time connectivity and analysis. Machines and products (and every physical and organic thing) embedded with sensors and software - connected to other machines, networked systems, and to humans - allows us to cheaply and automatically collect and share data, analyze it and find valuable meaning. Machines and products in the future will have the intelligence to deliver the right information to the right people (or other intelligent machines and networks), any time, to any device. When smart machines and products can communicate, they help us and other machines understand so we can make better decisions, act fast, save time and money, and improve products and services.
The IOT, Data Science and Big Data will combine to create a revolution in the way organizations use technology and processes to collect, store, analyze and distribute any and all data required to operate optimally, improve products and services, save money and increase revenues. Simply put, welcome to the new information age, where we have the potential to radically improve human life (or create a dystopia - a subject for another time).
The IOT will produce gigantic amounts of data. Yet data alone is useless - it needs to be interpreted and turned into information. However, most information has limited value - it needs to be analyzed and turned into knowledge. Knowledge may have varying degrees of value - but it needs specialized manipulation to transform into valuable, actionable insights. Valuable, actionable knowledge has great value for specific domains and actions - yet requires sophisticated, specialized expertise to be transformed into multi-domain, cross-functional wisdom for game changing strategies and durable competitive advantage.
Big data may provide the operating system and special tools to get actionable value out of data, but the soul of the data, the knowledge and wisdom, is the bailiwick of the data scientist.
Read more…

Join us April 14th at 9am PDT for our latest DSC's Webinar Series: The Science of Segmentation: What Questions You Should be Asking Your Data? sponsored by Pivotal.

Space is limited.

Reserve your Webinar seat now

Enterprise companies starting the transformation into a data-driven organization ​often ​​wonder where to start. Companies have traditionally collected large amounts of data from sources such as operational systems. With the rise of big data, big data technologies and the ​I​nternet of ​T​hings​ (IoT)​, additional sources​ – such as sensor readings and social media posts​ – are rapidly becoming available. In order to effectively utilize both traditional sources ​and new ones, companies first need to join and view the data in a holistic context. After establishing a data lake to bring all data sources together in a single analytics environment, one of the first data science projects ​worth exploring is segmentation​, which automatically identif​ies​ patterns.

In this DSC webinar, two Pivotal data scientists will discuss:

  • What segmentation is
  • Traditional approaches to segmentation
  • How big data technologies are enabling advances in this field

They will also share some stories from past ​d​ata ​s​cience ​engagements, ​outline ​best practices and discuss the kinds of insights ​that can be derived from a big data approach to segmentation using both internal and external data sources.

Grace Gee, Data Scientist​ -- Pivotal​
Jarrod Vawdrey, Data Scientist -- Pivotal​

Hosted by:
Tim Matteson, Co-Founder -- Data Science Central

Again, Space is limited so please register early:
Reserve your Webinar seat now


After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Guest blog post by Fari Payandeh


Fari Payandeh

Fari Payandeh

Aug 12, 2013

May I say at the outset that I know the phrase “Data Suction Appliance” sounds awkward at its best and downright awful at its worst. Honestly, I don’t feel that bad! These are some of the words used in Big Data products or company names: story, genie, curve, disco, rhythm, deep, gravity, yard, rain, hero, opera, karma… I won’t be surprised if I come across a  start-up named WeddingDB next week.

 Although there is so much hype surrounding social media data, the real goldmine is in the existing RDBMS Databases and to a lesser degree in Mainframes. The reason is obvious. Generally speaking data capture has been driven by business requirements, and not by some random tweets about where to meet for dinner.  In short, the Database vendors are sitting on top of the most valuable data.

 Oracle, IBM, and Microsoft “own” most of the data in the world. By that I mean if you run a query in any part of the world,  it’s very likely that you are reading the data from a Database owned by them. The larger the volume of data, the greater the degree of ownership; just ask anyone who has attempted to migrate 20 TB of data from Oracle to DB2. In short, they own the data because the customers are locked-in. Moreover, the real value of data is much greater than the revenues generated from the Database licenses. In all likelihood the customer will buy other software/applications from the same vendor since it’s a safe choice. From the Database vendors’ standpoint the Database is a gift that keeps on giving. Although they have competed for new customers, due to absence of external threats (Non-RDBMS technology), they have enjoyed being in a growing market that has kept them happy. Teradata, MySql (Non-Oracle flavors), Postgres, and Sybase have a small share of the overall Database market.

 The birth of Hadoop and NoSql technology represented a seismic shift that shook the RDBMS market not in terms of revenue loss/gain, but in offering an alternative to businesses . The Database vendors moved quickly to jockey for position and contrary to what some believe, I don’t think they were afraid of a meltdown. After all who was going to take their data? They responded to the market lest they be deprived of the Big Data windfall.

 IBM spent $16 billion on its Big Data portfolio and launched PureData for Hadoop; a hardware/software system composed of IBM Big Data stack. It introduced SmartCloud and recently backed Pivotal’s Cloud Foundry.  Cloud Foundry is "like an operating system for the cloud," Andy Piper, developer advocate for Cloud Foundry at Pivotal.

 Microsoft HDInsight products integrate with Sql Server 2012, System Center, and other Microsoft products; the Azure cloud-based version integrates with Azure cloud storage and Azure Database.

 Oracle introduced Big Data appliance bundle comprising Oracle NoSql Database, Oracle Linux, Cloudera Hadoop, and Hotspot Java Virtual Machine. It also offers Oracle Cloud Computing.

 What is Data Suction Appliance? There is a huge market for a high performance data migration tool that can copy the data stored in RDBMS  Databases to Hadoop.  Currently there are no fast ways of transferring data  to Hadoop; Performance is sluggish. What I envision is data transfer at the storage layer and not Database layer. Storage vendors such as EMC and NetApp  have an advantage in finding a solution while working with Data Integration vendors like Informatica. Informatica recently partnered with VelociData, provider of hyper-scale/hyper-speed engineered solutions. Is it possible? I would think so. I know that I am simplifying the process, but this is a high level view of what I see as a possible solution. Database objects are stored at specific disk addresses. It starts with the address of an instance within which the information about the root Tablespace or Dbspace is kept. Once the root Tablespace is identified, the information about the rest of the objects (Non-root Tablespaces, tables, indexes, …) is available in Data Dictionary tables and views. This information includes the addresses of the data files. Data file headers store the addresses of free/used extents and we continue on that path until data blocks containing the target rows are identified. Next, the Data Suction Appliance bypasses the Database and bulk copies the data blocks from storage to Hadoop. Some transformations may be needed during data transfer in order to bring in the data in a way that NoSql Databases can understand, but that can be achieved through an interface which will allow the Administrators to specify the data transfer options.  The future will tell if I am dreaming or as cousin Vinny said, "The argument holds water".

Read more…

One of the most valuable tools that I've used, when performing exploratory analysis, is building a data dictionary. It offers the following advantages:

  • Identify areas of sparsity and areas of concentration in high-dimensional data sets
  • Identify outliers and data glitches
  • Get a good sense of what the data contains, and where to spend time (or not) in further data mining

What is a data dictionary

A data dictionary is a table with 3 or 4 columns. The first column represents a label: that is, the name of a variable, or a combination of multiple (up to 3) variables. The second column is the value attached to the label: the first and second columns actually constitute a name-value pair. The third column is a frequency count: it measures how many times the value (attached to the label in question) is found in the data set. You can add a 4-th column, that tells the dimension of the label (1 if it represents one variable, 2 if it represents a pair of two variables etc.)

Typically, you include all labels of dimension 1 and 2 with count > threshold (e.g. threshold = 5), but no or only very few values (the ones with high count) for labels of dimension 3. Labels of dimension 3 should be explored after having built the dictionary for dim 1 and 2, by drilling down on label/value of dim 2, that have a high count.

Example of dictionary entry

category~keyword travel~Tokyo 756 2

In this example, the entry corresponds to a label of dimension 2 (as indicated in column 4), and the simultaneous combination of the two values (travel, Tokyo) is found 756 times in the data set.

The first thing you want to do with a dictionary is to sort it using the following 3-dim index: column 4, then column 1, then column 3. Then look at the data and find patterns.

How do you build a dictionary

Browse your data set sequentially. For each observation, store all label/value of dim 1 and dim 2 as hash table keys, and increment count by 1 for each of these label/value. In Perl, it can be performed with code such as $hash{"$label\t$value"}++. 

If the hash table grows very large, stop, save the hash table on file then delete it in memory, and resume where you paused, with a new hash table. At the end, merge hash tables after ignoring hash entries where count is too small.

Originally posted in by Vincent Granville

Read more…

Guest blog post by Vincent Granville

Theo: One idea is that you must purchase a number of transactions before using the paid service, and add dollars regularly. A transaction is a call to the API.

The service is accessed via an HTTP call that looks like

When the request is executed,

  • First the script checks if client has enough credits (dollars)
  • If yes it fetches the data on the client web server: the URL for the source data is yyy
  • Then the script checks if source data is OK or invalid, or client server unreachable
  • Then it executes the service zzz, typically, a predictive scoring algorithm
  • The parameter field tells whether you train your predictor (data = training set) or whether you use it for actual predictive scoring (data outside the training set)
  • Then it processes data very fast (a few secs for 1MM observations for the training step)
  • Then it sends an email to client when done, with the location (on the datashaping server) of the results (the location can be specified in the API call, as an additional field, with a mechanism in place to prevent file collisions from happening)
  • Then it updates client budget

Note all of this can be performed without any human interaction. Retrieving the scored data can be done with a web robot, and then integrated into the client's database (again, automatically). Training the scores would be charged much more than scoring one observation outside the training set. Scoring one observation is a transaction, and could be charged as little as $0.0025.

This architecture is for daily or hourly processing, but could be used for real time if parameter is not set to "training". However, when designing the architecture, my idea was to process large batches of transactions, maybe 1MM at a time.
Read more…

R in your browser

This blog was originally posted in by Mirko Krivanek

You have to see it to believe it. Go to and you can enter R commands in the browser-embedded console. Wondering how easy it would be to run R from a browser, on your iPad. I'm not sure how you would import data files, but I suppose R offers the possibility to open a file located on a web or ftp server, rather than a local file stored on your desktop. Or does it not? Also, it would be cool to have Python in a browser.

Related article


Read more…

Big Data, IOT and Security - OH MY!

Guest blog post by Carla Gentry

While we aren’t exactly “following the yellow brick road” these days, you may be feeling a bit like Dorothy from the “Wizard of Oz” when it comes to these topics. No my friend, you aren’t in Kansas anymore! As seem above from Topsy, these three subjects are extremely popular these days and for the last 30 days seem to follow a similar pattern (coincidence?).


The internet of things is not just a buzzword and is no longer a dream, with sensors abound. The world is on its way to become totally connected, although it will take time to work out a few kinks here and there (with a great foundation, you create a great product; this foundation is what will take the most time). Your appliances will talk to you in your “smart house” and your “self-driving car” will take you to your super tech office where you will work with ease thanks to all the wonders of technology. But let’s step back to reality and think, how is all this going to come about, what will we do with all the data collected and how will we protect it?


First thing first is all the sensors have to be put in place, and many questions have to be addressed. Does a door lock by one vendor communicate with a light switch by another vendor, and do you want the thermostat to be part of the conversation and will anyone else be able to see my info or get into my home?

How will all the needed sensors be installed and will there be any “human” interaction? It will take years to put in place all the needed sensors but there are some that are already engaging in the IOT here in the US. Hotels (as an example but not the only one investing in IOT) are using sensors connected to products that they are available for sale in each room, which is great but I recently had an experience with how “people” are the vital part of “IOT” – I went to check out of a popular hotel in Vegas, when I was asked if I drank one of the coffees in the room, I replied, “no, why” and was told that the sensor showed that I had either drank or moved the coffee, the hotel clerk verified that I had “moved” and not “drank” the coffee but without her, I would have been billed and had to refute the charge. Refuting charges are not exactly good for business and customers service having to handle “I didn’t purchase this” disputes 24/7 wouldn’t exactly make anyone’s day, so thank goodness for human interactions right there on the spot.


“The Internet of Things” is not just a US effort - Asia, in my opinion, is far ahead of the US, as far as the internet of things is concerned. If you are waiting in a Korean subway station, commuters can browse and scan the QR codes of products which will later be delivered to their homes. (Source: Tesco) - Transport for London’s central control centers use the aggregated sensor data to deploy maintenance teams, track equipment problems, and monitor goings-on in the massive, sprawling transportation systemTelent’s Steve Pears said in a promotional video for the project that "We wanted to help rail systems like the London Underground modernize the systems that monitor it’s critical assets—everything from escalators to lifts to HVAC control systems to CCTV and communication networks." The new smart system creates a computerized and centralized replacement for a public transportation system that used notebooks and pens in many cases.


But isn't the Internet of Things too expensive to implement? Many IoT devices rely on multiple sensors to monitor the environment around them. The cost of these sensors declined 50% in the past decade, according to Goldman Sachs. We expect prices to continue dropping at a steady rate, leading to an even more cost-effective sensor.



The Internet of Things is not just about gathering of data but also about the analysis and use of data. So all this data generated by the internet of thing, when used correctly, will help us in our everyday life as consumer and help companies keep us safer by predicting and thus avoiding issues that could harm or delay, not to mention the costs that could be reduced from patterns in data for transportation, healthcare, banking, the possibilities are endless.


Let’s talk about security and data breaches – Now you may be thinking I’m in analytics or data science why should I be concerned with security? Let’s take a look at several breaches that have made the headlines lately.


Target recently suffered a massive security breach thanks to attacker infiltrating a third party. and so did Home depot PC world said “Data breach trends for 2015: Credit cards, healthcare records will be vulnerable



Sony was hit by hackers on Nov. 24, resulting in a company wide computer shutdown and the leak of corporate information, including the multimillion-dollar pre-bonus salaries of executives and the Social Security numbers of rank-and-file employees. A group calling itself the Guardians of Peace has taken credit for the attacks.


So how do we protect ourselves in a world of BIG DATA and the IOT?
Why should I – as a data scientist or analyst be worried about security, that’s not really part of my job is it? Well if you are a consultant or own your own business it is! Say, you download secure data from your clients and then YOU get hacked, guess who is liable if sensitive information is leaked or gets into the wrong hands? What if you develop a platform where the client’s customers can log in and check their accounts, credit card info and purchase histories are stored on this system, if stolen, it can set you up for a lawsuit. If you are a corporation, you are protected in some extents but what if you operate as a sole proprietor – you could lose your home, company and reputation. Still think security when dealing with big data isn’t important?

Organizations need to get better at protecting themselves and discovering that they’ve been breached plus we, the consultants, need to do a better job of protecting our own data and that means you can’t use password as a password! Let’s not make it easy for the hackers and let’s be sure that when we collect sensitive data and yes, even the data collected from cool technology toys connected to the internet, that we are security minded, meaning check your statements, logs and security messages - verify everything! When building your database, use all the security features available (masking, obfuscation, encryption) so that if someone does gain access, what they steal is NOT usable!


Be safe and enjoy what tech has to offer with peace of mind and at all cost, protect your DATA.


I’ll leave you with a few things to think about:

“Asset management critical to IT security”
"A significant number of the breaches are often caused by vendors but it's only been recently that retailers have started to focus on that," said Holcomb. "It's a fairly new concept for retailers to look outside their walls." (Source:


“Data Scientist: Owning Up to the Title”
Enter the Data Scientist; a new kind of scientist charged with understanding these new complex systems being generated at scale and translating that understanding into usable tools. Virtually every domain, from particle physics to medicine, now looks at modeling complex data to make our discoveries and produce new value in that field. From traditional sciences to business enterprise, we are realizing that moving from the "oil" to the "car", will require real science to understand these phenomena and solve today's biggest challenges. (Source:



Forget about data (for a bit) what’s your strategic vision to address your market?

Where are the opportunities given global trends and drivers? Where can you carve out new directions based on data assets? What is your secret sauce? What do you personally do on an everyday basis to support that vision? What are your activities? What decisions do you make as a part of those activities? Finally what data do you use to support these decisions?


Read more…

What MapReduce can't do

Guest blog post by Vincent Granville

We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.

MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.

MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.

However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

The Problem

Let's say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is up tomorrow.

You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help. I'll discuss another example (keyword taxonomy) later in this article.

Three Solutions

Here I propose three solutions:

1. Sampling

Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).

I'll soon publish an article that shows how approximate solutions (a local optimum) to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2x2 combinations (as in this correlation issue), but 3x3, 4x4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.

2. Binning

If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it's much faster: for each pair you get the correlation from the lookup table - no computation required, only accessing a value in a hash table or an array with 512 cells.

Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.

3. Classical data reduction

Traditional reduction techniques can also be used: forward or backward step-wise techniques where (in turn) you add or remove one variable at a time (or maybe two). The variable added is chosen to maximize the resulting entropy, and conversely for variables being removed. Entropy can be measured in various ways. In a nutshell, if you have two data subsets (from the same large data set),

  • A set A with 100 variables, which is 1.23 GB when compressed, 
  • A set B with 500 variables, including the 100 variables from set A, which is 1.25 GB when compressed

Then you can say that the extra 400 variables (e.g. stocks symbols) in set B don't bring any extra predictive power and can be ignored. Or in other words, the lift obtained with the set B is so small that it's probably smaller than the noise inherent to these stock price signals.

Note: An interesting solution consists of using a combination of the three previous strategies. Also, be careful to make sure that the high correlations found are not an artifact caused by the "curse of big data" (see reference article below for details).

Another example where MapReduce is of no use

Building a keyword taxonomy:

Step 1:

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.

A="California insurance", B="home insurance", x=543, y=998, z=11


  • x is the number of occurrences of keyword A in all the web pages that you crawled
  • y is the number of occurrences of keyword B in all the web pages that you crawled
  • z is the number of occurences where A and B form a pair (e.g. they are found on a same page)

This "keyword pair" table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a "keyword pair", in other words, z=0. So by ignoring these null entries, your "keyword pair" table is still manageable, and might contain as little as 50 million entries.

Step 2:

To create a taxonomy, you want to put these keywords into similar clusters. One way to do it is to compute a dissimilarity d(A,B) between two keywords A, B. For instances d(A, B) = z / SQRT(x * y), although other choices are possible. The higher d(A, B), the closer keywords A and B are to each other. Now the big problem is to perform clustering - any kind of clustering, e.g. hierarchical - on the "keyword pair" table, using any kind of dissimilarity. This problem, just like the correlation problem, can not be split into sub-problems (followed by a merging step) using MapReduce. Why? Which solution would you propose in this case?

Related articles:

Read more…

Originally posted in Data Science Central by Mirko Krivanek

Leaflet is a modern open-source JavaScript library for mobile-friendly interactive maps. It is developed by Vladimir Agafonkin with a team of dedicated contributors. Weighing just about 33 KB of JS, it has all the features most developers ever need for online maps.

Leaflet is designed with simplicityperformance and usability in mind. It works efficiently across all major desktop and mobile platforms out of the box, taking advantage of HTML5 and CSS3 on modern browsers while still being accessible on older ones. It can be extended with a huge amount of plugins, has a beautiful, easy to use and well-documented API and a simple, readable source code that is a joy to contribute to.

In this basic example, we create a map with tiles of our choice, add a marker and bind a popup with some text to it:

For an interactive map and source code in text format, click here.

Learn more with the quick start guide, check out other tutorials, or head straight to the API documentation
If you have any questions, take a look at the FAQ first.

Related Articles

Read more…

Guest blog post by Don Philip Faithful

The idea of environmental determinism once made a lot of sense. Hostile climates and habitats prevented the expansion of human populations. The conceptual opposite of determinism is called possibilism. These days, human populations can found living in many inhospitable habitats. This isn't because humans have physically evolved. But rather, we normally occupy built-environments. We exist through our technologies and advanced forms of social interaction: a person might not be able to build a house, but he or she can arrange for financing to have a house constructed. "Social possibilism" has enabled our survival in inhospitable conditions. Because humans today almost always live within or in close proximity to built-environments, among the most important factors affecting human life today is data. The systems that support human society make use of data in all of its multifarious forms; this being the case, data science is important to our continuation and development as a species. This blog represents a discussion highlighting the need for a universal data model. I find that the idea of "need" is highly subjective; and perhaps the tendency is to focus on organizational needs specifically. I don't dispute the importance of such a perspective. But I hope that readers consider the role of data on a more abstract level in relation to social possibilism. It is this role that the universal data model is meant to support. Consider some barriers or obstacles that underline the need for a model, listed below.

Barriers to Confront

I certainly don't suggest that in this blog that I am introducing the authoritative data model to end all models. Quite the contrary, I feel that my role is to help promote discussion. I imagine even in the list of barriers, there might be some disagreement among data scientists.

(1) Proxy reductionism triggered by instrumental needs: I believe some areas of business management have attempted to address highly complex phenomena through the simplification of proxies (i.e. data). The nominal representation of reality facilitates production, but also insulates an organization from its environment. Thus production can occur disassociated from surrounding phenomena. I feel that this nominalism is due to lack of a coherent model to connect the use of data to theory.  We gain the illusion of progress through greater disassociation, exercising masterful control over data while failing to take into account and influence real-life conditions.

(2) Impairment from structurally inadequate proxies: Upon reducing a problem through the use of a primitive proxies, an organization might find development less accessible. I believe that a data model can help in the process of diagnosis and correction. I offer some remedial actions likely applicable to a number of organizations: i) collection of adequate amounts of data; ii) collection of data of greater scope; and iii) ability to retain the contextual relevance of data.

Social Disablement Perspective

My graduate degree is in critical disability studies - a program that probably seems out-out-place in relation to data science. Those studying traditional aspects of disability might argue that this discipline doesn't seem to involve big data, algorithms, or analytics. Nonetheless, I believe that disablement is highly relevant in the context of data science albeit perhaps in a conceptual sense. While there might not be people with apparent physical or mental disabilities, there are still disabling environments. Organizations suffering from an inability to extract useful insights from their data might not be any more disabled than the data scientist surrounding by tools and technologies disassociated from their underlying needs. Conversely, those in the field of disability might discuss the structural entrenchment of disablement without ever targeting something as pervasive as data systems. However, for those open to different perspectives, I certainly discuss aspects of social disablement in my blogs all the time. Here, I will be arguing that at its core, data is the product of two forces in a perpetual tug-of-war: disablement and participation. So there you go. I offer some cartoon levity as segue.

I recently learned that the term "stormtroopers" has been used to describe various military forces. For the parable, assume that I mean Nazi shock troops. I'm uncertain how many of my peers have the ability to write computer programs. I create applications from scratch using a text editor. Another peculiarity of mine is the tendency to construct and incorporate elaborate models into my programming. It is never enough for a program to do something. I search for a supporting framework. Programming for me is as much about research through framework-development as it is about creating and running code. In the process of trying to communicate models to the general public, I sometimes come up with examples that I admit are a bit offbeat. Above in the "Parable of the Stormtrooper and the Airstrip," I decided to create personifications to explain my structural conceptualization of data. The stormtrooper on the left would normally be found telling people what to do. Physical presence or presence by physical proxy is rather important. (I will be using the term "proxy" quite frequently in this blog.) He creates rules or participates in structures to impose those rules. He hollers at planes to land on his airstrip. I chose this peculiar behaviour deliberately. Command for the soldier is paramount, effectiveness perhaps less so. In relation to the stormtrooper, think social disablement; this is expressed on the drawing as "projection."

On the other side of the equation is this person that sort of resembles me and who I have identified as me although this is a personification of an aspect of data. He is not necessarily near or part of the enforcement regime. His objective rather than to compel compliance is to make sense of phenomena: he aims to characterize and convey it especially those aspects of reality that might be associated with but not necessarily resulting from the activities of the stormtrooper. There are no rules for this individual to impose. Nor does he create structures to assert his presence over the underlying phenomena. In his need to give voice to phenomena, he seeks out "ghosts" through technology. If this seems a bit far-fetched, at least think of him as a person with all sorts of tools designed to detect events that are highly evasive. Perhaps his objective is to monitor trends, consumer sentiment, heart palpitations, or patterns leading to earthquakes. Participation is indicated on the drawing as "articulation."

So how is a model extracted from this curious scene? I added to the drawing what I will refer to as the "eye": data appears in the middle surrounded by projection and articulation. Through this depiction, I am saying that data is often never just plain data. It is a perpetual struggle between the perceiver and perceived. I think that many people hardly give "data" much thought: e.g. here is a lot of data; here is my analysis; and here are the results. But let us consider the idea that data is actually quite complicated from a theoretical standpoint. I will carry on this discussion using an experiment. The purpose of this experiment is not to arrive at a conclusion but rather perceive data in its abstract terms.

An Experiment with No Conclusion

A problem when discussing data on an abstract level is the domain expertise of individuals. I realize this is an ironic position to take given so many calls for greater domain expertise in data science. The perspective of a developer is worth considering: he or she often lacks domain expertise, and yet this person is responsible for how different computer applications make use of data. Consequently, in order to serve the needs of the market, it is necessary for the developer to consider how "people" regard the data. Moreover, the process of coding imposes distance or abstraction since human mental processes and application processes are not necessarily similar. A human does not generate strings from bytes and store information at particular memory addresses. But a computer must operate within its design parameters. The data serves human needs due to the developer's transpositional interpretation of the data. The developer shapes the manner of conveyance, defines the structural characteristics of the data, and deploys it to reconstruct reality.

I have chosen an electrical experiment. There is a just single tool, a commercial grade voltmeter designed to detect low voltages. The voltage readings on this meter often jump erratically when I move it around a facility full of electrical devices; this behaviour occurs when the probes aren't touching anything. Now, the intent in this blog is not to determine the cause of the readings. I just want readers to consider the broader setting. Here is the experiment: with the probes sitting idle on a table, I took a series of readings at two different times of the day. The meter detected voltage - at first registering negative then becoming positive after about a minute. As indicated below on the illustration, these don't appear to be random readings. Given that there is data, what does it all mean? The meter is reading electrical potential, and this is indeed the data. What is the data in more abstract terms regardless of the cause?

Being a proxy is one aspect of data. Data is a constructed representation of its underlying phenomena: the electrical potential is only a part of the reality captured in this case by the meter. The readings from the meter define and constrain the meaning of the data such that it only relates to output of the device. In other words, what is the output of the device? It is the data indicated on the meter. It is a proxy stream; this is what we might recognize in the phenomena; for this is what we obtain from the phenomena using the meter. From the experiment itself, we actually gain little understanding of the phenomena. We only know its electrical readings. So although the data is indeed some aspect of the articulated reality, this data is more than anything a projection of how this reality is perceived. It is not my intention to dismiss the importance of the meter readings. However, we would have to collect far more data to better understand the phenomena. Our search cannot be inspired by the meter readings alone; it has to be driven by the phenomena itself.

Another problem relates to how the meter readings are interpreted. Clearly the readings indicate electrical potential; so one might suggest that the data provides us with an understanding of how much potential is available. The meter provides data not just relating to electrical potential alone but also dynamic aspects of the phenomena: its outcomes, impacts, and consequences. This is not to say that electrical potential is itself the outcome or antecedent of an outcome; but it is part of the reality of which the device is designed to provide only readings of potential. We therefore should distinguish between data as a proxy and the underlying phenomena, of which the data merely provides a thin connection or conduit. There is a structure or organizational environment inherent in data that affects the nature and extent to which the phenomena is expressed. The disablement aspect confines phenomena to contexts that ensure the structure fulfills instrumental requirements. Participation releases the contextual relevance of data.

Initial Conceptualization

I have met people over the years that refuse to question pervasive things. I am particularly troubled by the expression "no brainer." If something is a no-brainer, it hardly makes sense to discuss it further; so I imagine these people sometimes avoid deliberating over the nature of things. This strategy is problematic from a programming standpoint where it is difficult to hide fundamental lack of knowledge. It then becomes apparent that the "no brainer" might be the person perceiving the situation as such. Keeping this interpretation of haters and naysayers in mind, let's consider the possibility that it actually takes all sorts of brains to characterize data - that in fact the task can incapacitate both people and supercomputers. If somebody says, "Hey, that's a no brainer" to me or anybody else, my response will be, "You probably mean that space in your head!"  (Shakes fist at air.)

I provide model labels on the parable image: projection, data, and articulation. I generally invoke proper terms for aspects of an actual model. "Disablement" can be associated with "projection" on the model; and "participation" with the term "articulation." The conceptual opposition is indicated on the image below as point #1. Although the parable makes use of personifications, there can sometimes be entities in real-life doing the projection: e.g. the oppressors. There can also be real people being oppressed. In an organizational context, the issue of oppression is probably much less relevant, but the dynamics still persist between the definers and those being defined: e.g. between corporate strategists and consumers. Within my own graduate research, I considered the objectification of labourers and workers. As production environment have developed over the centuries, labour has become commodified. In the proxy representation, workers have been "defined" using the most reified metrics; but there is a counterforce also, for self-definition or some level of autonomy. Data exists within a context of competing interests as indicated on point #2

From the experiment I indicated how data is like a continuum formed by phenomena and its radiating consequences: I said that readings can be taken of dynamic processes. This is a bit like throwing stones in a lake and having sensors detect ripples and counter-ripples. An example would be equity prices in the stock market where a type of algorithmic lattice can bring to light the dynamic movement of capital. Within this context, it is difficult to say whether what we are measuring is more consequence or antecedent; but really it is both. I believe it is healthy to assume that the data-logger or reading device offers but the smallest pinhole to view the universe on the other side. Point #3 shows these additional dynamics. There is a problem here in terms of graphical portrayal - how to bring together all three points into a coherent model. I therefore now introduce the universal data model. I also call this the Exclamation Model or the Model! The reasons will be apparent shortly.


The Exclamation Model visually resembles an exclamation mark, as shown on the image below. For the purpose of helping readers navigate, I describe the upper portion of the model as "V" and the lower part as "O," or "the eye" as I mentioned previously since it resembles a human eye. The model attempts to convey all of the different things that people tend to bundle up in data perhaps at times subconsciously. An example I often use in my blogs is sales data, which doesn't actually tell us much about why consumers buy products. There might be high demand one year followed by practically no demand the next; yet analysts try to plot sales figures as if the numbers follow some sort of intrinsic force or built-in natural pattern. Sales figures do not represent an articulation of the underlying phenomena, but rather it causes externally desired aspects of the phenomena to conform to an interpretive framework. Within any organizational context, there is a battle to dictate the meaning of data. If an organization commits itself to the collection of sales data and nothing beyond this to understand its market, it would be difficult at a later time to find a suitable escape route leading away from the absence of information. The eye is inherent in the structure of data extending in part from the authority and control of those initiating its collection.

As one goes up the V, both projection and articulation transform to accommodate the increasing complexity of the phenomena; but also while going up, there is greater likelihood of separation between the articulated reality (e.g. employee stress) and the instrumental projection (e.g. performance benchmarks) resulting in different levels of alienation. As one travels down the V, there is less detachment amid declining complexity, which improves the likelihood of inclusion. In this explanation, I am not suggesting that alienation or inclusion is directly affected by the level of sophistication in the data. The V can become narrower or wider depending on design. Complexity itself does not cause alienation between data and its phenomena; but there is greater need for design to take complexity into account due to the risk of alienation. It might be tempting to apply this model to social phenomena directly, but actually this is all meant for the data that is conveying phenomena. Data can be alienated from complex phenomena.

Rooted in Systems Theory

I realize that the universal data model doesn't resemble a standard input-process-output depiction of a system; but actually it is systemic. Projection provides the arrow for productive processes sometimes portrayed in a linear fashion: input, process, and output. Articulation represents what has often been described as "feedback." Consequently, the eye suggests that the entire system is a source of data. In another blog, I support this argument by covering three major data types that emerge in organizations: data related to projection resulting from metrics of criteria; data from routine operations as part of production processes; and data from articulation from the metrics of phenomena. The eye is rather like a conventional system viewed from a panoramic lens. The V provides an explanation of the association between proxies and phenomena under different circumstances.

Arguments Regarding Evidence

The simplification movement has mostly been about simplification of proxies and not the underlying phenomena. Data as a proxy is made simpler in relation to what it is meant to represent. Consider a common example: although employees might have many attributes and capabilities, in terms of data they are frequently reduced to hours worked. The number of hours worked is a metric intended to indicate the cost of labour. A data system might retain data focused on the most instrumental aspects of production thereby giving the illusion that an organization is only responsible for production. I feel that as society becomes more complex and the costs associated with data start to decline in relation to the amount of data collected, the obligation that an organization has to society will likely increase. This obligation will manifest itself in upgrades to data systems and not only this but improved methodologies surrounding the collection and handling of data. The model provides a framework to examine the extent to which facts could and should have been collected. Consider a highly complex problem such as infection rates in a hospital. The hospital might address this issue by collecting data on the number of hours lost through illness and sick days used. But this alone does not provide increased understanding of infections; some might argue therefore that such inadequate efforts represent a deliberate form of negligence apparent in the alienation of proxies.

Relation to Computer Coding

I have a habit of inventing terms to describe things particularly in relation to application development. Experience tells me that if I fail to invent a term and dwell on its meaning, the thing that I am attempting to describe or understand will fade away. I am about to make use of a number of terms that have meaning to me in my own projects; and I just want to explain that I claim no exclusive rights or authority over these terms. In this blog, I have described data as "proxy" for "phenomena." I make use of a functional prototype called Tendril to examine many different types of data. Using Tendril, there are special terms to describe particular types of proxies: events, contexts, systems, and domains. These proxies all represent types of data or more specifically the organization of aspects of phenomena that we customarily refer to as data.

The most basic type of proxy is an event. I believe that when most people use the term "data," they mean a representation quite close to a tangible aspect of phenomena. I make no attempt to confine the meaning of phenomena. There can be hundreds of events describing different aspects of the same underlying reality. I consider the search for events a fluid process that occurs mostly on a day-to-day level rather than during design. Another type of proxy - i.e. a different level of data - is called a context. Phenomena can "participate" in events. The "relevance" of events to different contexts is established on Tendril using something called a relevancy algorithm. I placed a little person on the illustration to show what I consider to be the comfort zone for most people in relation to data. I would say that people tend to focus on the relevance of events to different contexts.

The idea of "causality" takes on special meaning in relation to the above conceptualization. Consider the argument that poverty is associated with diabetes. Two apparently different domains are invoked: social sciences and medicine. Thus, the events pertaining to social phenomena are being associated with a medical context. The social phenomena might relate to unemployment, stress, poor nutrition, inaccessible education, violence, homelessness, inadequate medical care: any reasonable person even without doing research could logically infer adverse physiological and psychological consequences. Yet the connection might not be made I believe because the proxy seems illegitimate. How can a doctor prescribe treatment? If human tolerance for social conditions has eroded, one approach is to treat the problem as if it were internal to the human body. Yet the whole point of the assertion is to identify the importance of certain external determinants. Society has come to interpret diabetes purely as a medical condition internal to the body. This is an example of how data as a proxy can become alienated from complex underlying phenomena. We say that people are diseased, failing to take into account the destructive and unsustainable environment that people have learned to tolerate.

Since there is no ceiling or floor on the distribution of proxies in real life, the focus (on contexts and events) does not necessarily limit the data that people use but rather the way that they interpret it, not being machines. I feel that due to its abundance, people habitually choose their place in relation to data; and they train themselves to ignore data that falls outside their preferred scope. Moreover, the data that enters their scope becomes contextually predisposed. Consequently, it might seem unnecessary to make use of massive amounts of data and many different contexts (e.g. in relation to other interests). But this predisposition is like choice of attire. The fact that data might fall outside of scope does not negate its broader relevance; nor does its presence within scope mean that it is relevant only in a single way.

The Phantom Zone

It is not through personal strength or resources that a person can get a road fixed. One calls city hall. There is no need to build shelter. One rents an apartment or buys a house. In human society, there are systems in place to handle different forms of data. These systems operate in the background at times without our knowledge enabling our existence in human society and offering comfort. Our lack of awareness does not mean that the systems do not exist. Nor does our lack of appreciation for the data mean that the structure of the data is unimportant. In fact, I suggest that the data can enable or disable the extent to which these systems serve the public good. Similarly, the way in which organizations objectify and proxy phenomena can lead to survivorship outcomes. An organization can bring about its own deterministic conditions.

The universal data model - really just "introduced" in this blog - is meant to bring to light the power dynamics inherent in data: the tug-of-war between disablement and participation. I have discussed how an elaborate use of proxies can help to reduce alienation (of the data from its underlying phenomena) and accommodate greater levels of complexity to support future development. This blog was inspired to some extent by my own development projects where I actually make creative use of proxies to examine phenomena. However, this is research-by-proxy - to understand through the manipulation of data structures the existence of ghosts - entities that are not necessarily material in nature. I attempt to determine the embodiment of things that have no bodies - the material impacts of the non-material - the ubiquity of the imperceptible. It might seem that humans have overcome many hostile environments. While we have certainly learned to conquer the material world, there are many more hazards lurking in the chasms of our data awaiting discovery. However, before we can effectively detect passersby in the invisible and intangible world, we need to accept how our present use of data is optimized for quite the opposite. Our evolution as a species will depend on our ability to combat things beyond our natural senses.

Read more…

Data Scientists vs. Data Engineers

Guest blog post by Michael Walker

More and more frequently we see o rganizations make the mistake of mixing and confusing team roles on a data science or "big data" project - resulting in over-allocation of responsibilities assigned to data scientists. For example, data scientists are often tasked with the role of data engineer leading to a misallocation of human capital. Here the data scientist wastes precious time and energy finding, organizing, cleaning, sorting and moving data. The solution is adding data engineers, among others, to the data science team.
Data scientists should be spending time and brainpower on applying data science and analytic results to critical business issues - helping an organization turn data into information - information into knowledge and insights - and valuable, actionable insights into better decision making and game changing strategies.
Data engineers are the designers, builders and managers of the information or "big data" infrastructure. They develop the architecture that helps analyze and process data in the way the organization needs it. And they make sure those systems are performing smoothly.
Data science is a team sport . There are many different team roles, including: 
Business architects;
Data architects;
Data visualizers;
Data change agents.
Moreover, data scientists and data engineers are part of a bigger organizational team including business and IT leaders, middle management and front-line employees. The goal is to leverage both internal and external data - as well as structured and unstructured data - to gain competitive advantage and make better decisions. To reach this goal an organization needs to form a data science team with clear roles.
Read more…

Join us for the latest DSC Webinar on March 24, 2015
Space is limited.
Reserve your Webinar seat now
Please join us March 24, 2015 at 9am PST for our latest DSC's Webinar Series: Data Lakes, Reservoirs, and Swamps: A Data Science and Engineering Perspective sponsored by Think Big, a Teradata Company.

In the fast paced and ever changing landscape of Hadoop based data lakes, there tends to be varying definitions of what constitutes a data lake and how they should be used for business benefit—especially in leveraging data science.

In this webinar, Think Big will share their perspective on Hadoop data lakes from their many consulting engagements. Drawing from their experience across multiple industries, Daniel Eklund and Dan Mallinger will share stories of data lake challenges and successes. You will also learn how data scientists are leveraging Hadoop data lakes to discover, document, and enable new business insights.

Finally, the presenters will discuss skills needed for data science success and how to grow your skills if you want to become a data scientist. 

Daniel Eklund, Data Science Practice Manager, Think Big
Dan Mallinger, Engineering Practice Manager, Think Big

Hosted by: Tim Matteson, Cofounder, Data Science Central
Title:  Data Lakes, Reservoirs, and Swamps: A Data Science and Engineering Perspective
Date:  March 24, 2015
Time:  9:00 AM - 10:00 AM PT
Again, Space is limited so please register early:
Reserve your Webinar seat now
After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Guest blog post by Deepak Kumar

Before going into details about what is big data let’s take a moment to look at the below slides by Hewlett-Packard.

So by going through these slides you must have realized that how much data we are generating every second, of every minute, of every hour, of every day, of every month, of every year.

The phrase that is really popular nowadays and also talks the truth: We have generated more than 90% of data in the last two years itself.

And it is getting generated exponentially day by day with the increasing usage of devices and digitization across the globe.

So what is the problem with these huge amounts of data?

Earlier when common database management application systems were made those systems were built with a scale in mind. Even the organizations were not prepared of the scale that’s what we are producing nowadays.

Since the requirements of these organizations have increased over time, they have to rethink and reinvest in the infrastructure. Now the cost of resources involved in scaling up the infrastructure, gets increases with an exponential factor.

Further, there would be a limitation on the different factors like size of the machine, CPU, RAM etc that could be scaled up. These traditional systems would not be able to support the scale required by most of the companies.

Why traditional data management tools and technologies cannot handle these numbers?

Whatever data that is coming to us can be categorized with respect to VOLUME, VELOCITY and VARIETY. And the problem starts here.

  • Volume: Today organizations like NASA, Facebook, Google and many other such companies are producing enormous amount of data per day. These data needs to be stored, analyzed and  processed in order to know about the market, trends, customers and their problems along with the solutions.
  • Variety: We are generating data from different sources in different forms, like videos, text, images, emails, binaries and lots more, and most of these data are unstructured or semi structured. The traditional data systems that we know all works on structured data. so it is quite difficult for those system to handle the quality and quantity of data we are producing nowadays.
  • Velocity: Take an example of a simple query where you want to fetch the name of a person from millions of record. Till the time it is in millions or billions we are fine with the traditional systems , but when it reaches more than that even simplest of query takes lots of time for the execution. And here we are talking about the analysis and processing of data that is in the range of hundreds and thousands of petabytes, exabytes and much more. So to analyze the same we have to develop a system that will process the data at much higher speed and with high scalability.

These volume, velocity and variety also popularly known as 3 Vs are worked out using the solutions provided by BigData.  So before going into details of how bigdata handles these complex solutions, let’s try to create a short definition for BigData.

What is Big Data?

Dataset whose volume, velocity, variety and complexity are beyond the ability of commonly used tools to capture, process, store, manage and analyze them can be termed as BIGDATA.

How BigData is handling these complex situations?

Most of the BigData tools and framework architecture are built keeping in mind about the following characteristics:
  • Data distribution: The large data set is split into chunks or smaller blocks and distributed over N number of nodes or machines. Hence the data gets distributed on several nodes and becomes ready for parallel processing. In Big data world this kind of data distribution is done with the help of Distributed File System or DFS.
  • Parallel processing:  The distributed data gets the power of N number of servers and machines in which data is residing and works in parallel for the processing and analysis. After processing, the data gets merged for the final required result. The process is known as MapReduce which is adopted from Google’s MapReduce research work.
  • Fault tolerance: Generally we keep the replica of a single block (or chunk) of data more than once. Hence even if one of the servers or machine is completely down, we can get our data from a different machine or data center. Again we might think that replicating of data might cost lots of space. And here comes the fourth point for the rescue.

    • Use of Commodity hardware:  Most of the BigData tools and frameworks need commodity hardware for its working. So we don’t need specialized hardware with special RAID as Data container. This reduces the cost of the total infrastructure.
    • Flexibility and Scalability: It is quite easy to add more and more of rackspace into the cluster as the demand for space increases. And the way these architecture are made, it fits into the scenario very well.

    Well these are just a few examples from the bigdata reservoir for the complex problems that is getting solved using bigdata solutions. 

    Again this article talks about only a glass of water from the entire ocean. Go get started and take a dip dive in the bigdata world or if i can say BigData Planet :)

    The article First appeared on

    If you like what you just read and want to continue your learning on BIGDATA you can subscribe to our Email and Like our facebook page
    Read more…

    MapReduce / Map Reduction Strategies Using C#

    Guest blog post by Jake Drew

    A Brief History of Map Reduction

    Map and Reduce functions can be traced all the way back to functional programming languages such as Haskell and its Polymorphic Map function known as fmap.  Even before fmap there was the Haskell map command used primarily for processing against lists.  I am sure there are experts out there on the very long history of MapReduce who could provide all sorts of interesting information on that topic and the influences of both Map and Reduce functions in programming.  However, the purpose of this article is to discuss effective strategies for performing highly parallel map reductions in the C# programming language.  There are many large-scale packages out there for performing map reductions of just about any size.  Google's MapReduce and Apache's Hadoop platforms are two of the most well known.  However, there are many competitors in this space.  Here are just a few references.  MapReduce concepts are claimed to be around 25 years old by some.  Strangely enough, the patent for MapReduce is currently held by Google and was only issued during 2004.  Google says that MapReduce was developed for "processing large amounts of raw data, for example, crawled documents or web request logs".

    Understanding Map Reduction

    In more complex forms, map reduction jobs are broken into individual, independent units of work and spread across many servers, typically commodity hardware units, in order to transform a very large and complicated processing task into something that is much less complicated and easily managed by many computers connected together in a cluster.  In layman's terms, when the task at hand is too big for one person, then a crew needs to be called in to complete the work.  Typically, a map reduction "crew" would consist of one or more multi-processor nodes (computers) and some type of master node or program that manages the effort of dividing up the work between nodes (mapping) and the aggregation of the final results across all the worker nodes (reduction).  The master node or program could be considered the map reduction crew's foreman.  In actuality, this explanation is an over-simplification of most large map reduction systems.  In these larger systems, many additional indexing, i/o, and other data management layers could be required depending on individual project requirements.  However, the benefits of map reduction can also be realized on a single multi-processor computer for smaller projects.

    The primary benefits of any map reduction system come from dividing up work across many processors and keeping as much data in memory as possible during processing.  Elimination of disk i/o (reading data from and writing data to disk) represents the greatest opportunity for performance gains in most typical systems.  Commodity hardware machines each provide additional processors and memory for data processing when they are used together in a map reduction cluster.  When a cluster is deployed however, additional programming complexity is introduced.  Input data must be divided up (mapped) across the cluster's nodes (computers) in an equal manner that still produces accurate results and easily lends itself to the aggregation of the final results (reduction).  The mapping of input data to specific cluster nodes is in addition to the mapping of individual units of input data work to individual processors within a single node.  Reduction across multiple cluster nodes also requires additional programming complexity.  In all map reduction systems, some form of parallel processing must be deployed when multiple processors are used.  Since parallel processing is always involved during map reduction, thread safety is a primary concern for any system.  Input data must be divided up into individual independent units of work that can be processed by any worker thread at any time during the various stages of both mapping and reduction.  Sometimes this requires substantial thought during the design stages since the input data is not necessarily processed in a linear fashion.  When processing text data for instance, the last sentence of a document could be processed before the first sentence of a document since multiple worker threads simultaneously work on all parts of the input data.

    The following figure illustrates a map reduction process running on a single multi-processor computer.  During this process, multiple worker threads are simultaneously mapped to various portions of the input data placing the mapping results into a centralized location for further downstream processing by other reduction worker threads.  Since this process occurs on a single machine, mapping is less complex because the input data is only divided between worker threads and processors that all reside on the same computer and typically within the same data store.

    Map Reduction On a Single Computer

    When multiple computers are used in a  map reduction cluster, additional complexity is introduced into the process.  Input data must be divided between each node (computer) within the cluster by a master node or program during processing.  In addition, reduction results are typically divided across nodes and indexed in some fashion so mapping results can quickly be routed to the correct reduction node during processing.  The need for clustering typically occurs when input data, mapping results, reduction results, or all three are too large to fit into the memory of a single computer.  Once any part of the map reduction process requires disk i/o (reading data from and writing data to disk), a huge performance hit occurs.  It is very important to stress that this performance hit is exponential and deadly.  If you are still skeptical, please stop reading and take a quick lesson from the famous Admiral Grace Hopper here.  Obviously, some form of disk i/o is required to permanently save results from any program.  In a typical map reduction system however, disk i/o should be minimized or totally removed from all mapping and reduction processing and used only to persist or save the final results to disk when needed.

    The following figure illustrates a map reduction cluster running on four machines.  In this scenario, one master node is used to divide up (map) input data between three data processing nodes for eventual reduction.  One common challenge when designing clusters is that not all of the reduction data can reside in memory on one physical machine.  In an example map reduction system that processes text data as input and counts unique words, all words beginning with A-I might be stored in node 1, J-R in node 2, and S-Z in node 3.  This means that additional routing logic must be used to get each word to the correct node for a final reduction based on each word's first letter.

    Map Reduction Cluster

    When reduction or mapping results are located on multiple clustered machines, additional programming logic must be added to access and aggregate results from each machine as needed.  In addition, units of work must be allocated (mapped) to these machines in a manner that does not impact the final results.  During the identification of phrases for instance, one sentence should not be split across multiple nodes since this could causes phrases to be split across nodes and subsequently missed during phrase identification as a result.

    Map reduction systems can range in size from one computer to literally thousands of clustered computers in some enterprise level processes.  The C# programming language provides a suite of thread-safe objects that can easily and quickly be used to create map reduction style programs.  The following sections describe some of these objects and show examples of how to implement robust parallel map reduction processes using them.

    Understanding Map Reduction Using C#

    The C# programming language provides many features and classes that can be used to successfully perform map reduction processing as described in the sections above.  In fact, certain forms of parallel map reduction in C# can be performed by individuals having a minimal knowledge of thread pools or hardware specific thread management practices.  Other lower level tools however, require a great knowledge of both.  Regardless of the tools chosen, great care must be taken to avoid race conditions when parallel programs are deployed within a map reduction system.  This means that the designer must be very familiar with best demonstrated practices for both locking and multi-threaded programming when creating the portions of mapping and reduction programs that will be executed in parallel.  For those who need assistance in this area, a detailed article on threading in C# can be located here.

    One of the most important things to remember is that just because a particular C# object is considered "thread safe", the commands used to calculate a value that is passed to the "thread safe" object or the commands passed within a delegate to the "thread safe" object are not necessarily "thread safe" themselves.  IF a particular variable's scope extends outside the critical section of parallel execution, then some form of locking strategy must be deployed during updates to avoid race conditions.  One of the easiest ways to test for race conditions or threading errors in a map reduction program is to simply execute the program using the same set of input data multiple times.  Typically, the programs results will vary when a race condition or threading error is present.  However, the error might not present itself after only a few executions.  It is important to exhaustively test the program using many different sets of test data as input, and then execute the program many times against each input data set checking for output data variations each time.

    The specific C# classes described later in this document do not represent the only alternatives for performing parallel map reductions in the language.  The selected classes merely represent a few of the viable techniques worth consideration.  For instance, one available approach that is not covered in this document is PLINQ which is C#'s parallel implementation of LINQ to Objects.  Numerous other C# tools are available as well.  It is important to mention that the map reduction patterns described above are sometimes referred to / very similar to what is known as  producer / consumer pipelines.  Many great articles can be located on the internet when producer / consumer pipelines and C# are used together as search terms.

    The Map Reduction Nuts and Bolts

    Using the pattern described earlier, several basic C# components can be repeatedly used (and sometimes extended) to create a map reduction system of virtually any size.  The following high level C# components and classes will be used as the "nuts and bolts" of this particular system:

    • Parallel.For and Parallel.ForEach -  These two members of the System.Threading.Tasks namespace can be used to quickly create mapping functions that execute in parallel.  The commands executed within these "For" blocks must be thread safe.  Parallel mapping functions can be used to break apart input data into mapping results that are placed in a Blocking Collection for further downstream processing.
    • Blocking Collections - Blocking Collections are members of the System.Collections.Concurrent namespace and provide a centralized, thread safe location for multiple threads to add and remove objects during processing.  These collections can be implemented using concurrent bag (not ordered), stack (LIFO), or queue (FIFO) collections.  Thread safe versions of each collection are provided within the System.Collections.Concurrent namespace.  Once the Blocking Collection has been wrapped around the appropriate bag, stack, or queue, it will manage timing differences between various producers and consumers using the collection.  When the collection is empty it can block until new items are added or stop processing once all items have been processed and the collection is marked as complete.
    • Concurrent Dictionary - The thread safe Concurrent Dictionary will act as a key-value pair repository for the final reduction results in our process.  Although a database could be used for this part of the process, the Concurrent Dictionary is an ideal reduction repository candidate for several basic reasons that are explained in detail within the reduction section and examples below.  This dictionary is also a member of the System.Collections.Concurrent namespace.


    Parallel Mapping Using C#

    One of the easiest ways to implement a parallel map function in C# is by using the Parallel class.  Specifically Parallel.For or Parallel.ForEach can be used to quickly map (in parallel) independent units of work into a centralized, thread-safe collection (we will get to thread-safe collections in a second) for further downstream processing.  In addition, the class can perform parallel mappings with no lower level thread management programming required.  The Parallel class is hardware intelligent and scales threads based on the current platform it is executing on.  However, it also has the MaxDegreeOfParallelism option for those who want more control over how many threads a particular Parallel class process is using.  The primary purpose of the map function is to break apart input data producing one or more key-value pairs that require reduction processing by other downstream worker threads.  While mapping worker threads are producing these key-value pairs as output, reduction worker threads are simultaneously consuming (reducing) them.  Depending on the size of the map reduction process, other intermediary processes such as a partitioning process might occur between a mapping and its final reduction.

    Yield Return Blocks

    In a C# example application that counts unique words from large amounts of input text, one or more stages of mapping could be used to produce the final mapping key-value pair data output.  In order for both mapping and reduction worker threads to work in tandem, all phases of the mapping process must be executed asynchronously.  This can be accomplished by using either background threads, some form of Yield Return processing, or both.

    The following code demonstrates the use of Yield Return processing to break up input text into blocks of 250 characters or less using the space character as a word delimiter:
    Yield Return Mapping
    As each 250 character (or less) block of text is identified, the "yield return" command causes the process to "yield" and immediately return the identified text block to the calling process.  Under normal circumstances, all identified blocks of text would be returned at one time when the entire process was completed.  This would also mean that other downstream worker threads could not begin work on mapping the blocks of text into individual words until all text block identification was complete.  The delay would slow down the entire process greatly.  A yield return method for producing text blocks is not necessarily required for counting unique words in a map reduction system.  However, this code will be used to demonstrate how yield return can be used and subsequently called using Parallel.ForEach to complete the mapping of text to individual words.

    Blocking Collections

    When all stages of mapping are completed, the final results are added to a mapping results Blocking Collection.  Using our C# example application that counts unique words from large amounts of input text, a mapping results Blocking Collection is created called "wordChunks".  This particular Blocking Collection uses a ConcurrentBag as its base collection.  Since words are added to and removed from the ConcurrentBag in no particular order, using a "bag" yields performance gains over a "stack" or "queue" which must internally keep track of processing order.  The following code shows how the "wordChunks" Blocking Collection is created:

    Blocking Collection

    Technically a mapping function's output should be a key-value pair.  The key-value pair typically contains some form of key and associated numeric value that are both used during the reduction process.  In many cases, the key will only be contained one time within the final reduction results.  The values for each duplicate key encountered during mapping will be "reduced" by either summation or another mathematical calculation that results in one final reduction number value that is representative of the single key contained in the final key-value pair reduction results.  In our example word counting map reduction, a key-value pair is not even required for the mapping stage. The wordChunks bag can contain any number of words (given your current memory constraint).  These words can also be duplicates.  Since we are only counting the occurrence of words, each word in our bag is considered to have a default frequency of 1.  However, the ConcurrentBag could have just as easily been created as a collection of key-value pairs (ConcurrentBag<KeyValuePair<string,int>>), if needed.

    Parallel.ForEach Processing

    The next program demonstrates a Parallel.ForEach mapping function using the Yield Return method created before.  This process uses multiple worker threads to identify and clean words from the blocks of input text provided by the Yield Return method.   The Parallel.ForEach mapping process begins as soon as the first block of text is identified since Yield Return is being used.


    The program above uses Parallel.ForEach to call the text block production program "produceWordBlocks".  This program immediately yield returns blocks of text less than 250 characters in length and delimited by spaces as they are identified.  Parallel.ForEach worker threads simultaneously process these text blocks identifying individual words which are also delimited by spaces.  The program also removes any whitespace, punctuation, or control characters located within the words.  Obviously, this is an example program and many other character filtering or inclusion enhancements could made depending on your particular requirements.  In an alternative implementation, the Yield Return method could be removed entirely and its functionality included into a single Parallel.ForEach mapping program.  This may or may not produce better performance results depending on your code, the input data, and the requirements of your system.

    Once all individual words have been identified from all word blocks, the wordChunks Blocking Collection is notified that no more words will be added to the collection.  This notification is very important for any downstream worker threads that are simultaneously reducing / removing words from the collection.  If the Blocking Collection becomes empty during processing, collection consumers will continue to "block" or "wait" until either the CompleteAdding() method is called or additional items are added into the collection.  The Blocking Collection is able to successfully manage differences in the object production and consumption speeds of various worker threads using this approach.  In addition, a Blocking Collection Bounding Capacity can be added to ensure that no more than a maximum number of objects can be added to the collection at any given time.

    Parallel Reduction Using C#

    The process of parallel reduction is very similar to mapping with regards to the use of Parallel.ForEach to facilitate its process in our example application.  Where reduction differs however, is in its use of one or more data storage components that provide very fast access to a particular set of reduction key-value pairs.  When all data storage components are combined, the reduction key-value pairs eventually become the final output for map reduction once all input data has been processed. In a system where multiple map reduction computers are used together in a map reduction cluster, multiple data storage components could be used to store a very large number of key-value pairs across several computers.  In the example word counting map reduction process, the reduction key-value pairs consist of a unique list of words which act as keys for the reduction key-value pairs.  Each word key contains a value that represents the total frequency of occurrences for that particular word within the input text.  The final result becomes a mapping of all words contained in the input data to a reduced list of unique words and the frequency that each unique word occurs within the input data.

    The Concurrent Dictionary

    The C# ConcurrentDictionary is basically a thread safe hash table that is well suited for acting as the data store component within the example application.  The ConcurrentDictionary holds key-value pairs in memory until there is either no room left in memory or the dictionary already contains the maximum number of elements.  Since the ConcurrentDictionary takes a 32-bit hash of each key, the maximum number of elements is the same as the Int32.MaxValue or 2,147,483,647.  Most computers and processes will run out of memory prior to triggering the ConcurrentDictionary's overflow exception due to exceeding the maximum number of elements.  In situations where very large amounts of data are being map reduced, the map reduction data components (in this case Concurrent Dictionaries) can be sharded across several computers within a cluster.  However, sharding requires a slightly more complex map reduction process since key partitioning logic must be developed to manage the associated sharding challenges such as what nodes(computers) will contain what keys, node load balancing, node additions, node removals, and node failures.

    It is obvious that some readers may be asking why not use a database at this point, and this is a very valid question.  The short answer is that most any data management solution could be used.  Databases, NoSQL databases, In-memory databases, or some form of key-value datastore could be implemented.  The most important thing to consider however, is that most relational databases will rely heavily on i/o to complete the work.  Any form of i/o during map reduction processing will also most likely defeat the purpose of map reduction altogether.  So whatever data management solution is chosen, just make sure that your data is being stored in memory.  Even most relational databases now have some form of in-memory tables and clustering abilities.

    Parallel.ForEach Reduction Processing

    During reduction processing Parallel.ForEach is used once again to create multiple worker threads that simultaneously consume mapping results as they are being created and added to the "wordChunks" Blocking Collection.  Worker threads reduce all mapping results in the example application by looking up each mapped word within the reduction data store component (one or more ConcurrentDictionaries in this case).  If a word already exists in the data store, then the mapping word is reduced by incrementing the existing reduction word's key-value pair value by 1.  Otherwise, the mapping word is reduced by creating a new key-value pair entry in the reduction data store with a starting value of 1.  The following code demonstrates how this process works:

    Reduction Parallel.ForEach

    A ConcurrentDictionary was created to hold our final reduction processing results.  Furthermore, the AddOrUpdate() method was taken advantage of during Parallel.ForEach processing.  It is important to mention once again that delegates provided to a thread safe object are not necessarily thread safe themselves.  In this case, the AddOrUpdate method accepts a delegate to provide update commands to execute when a key is a present in the wordStore dictionary.  To ensure that the update is performed in a thread safe manner, Interlocked.Increment is used to increment existing values by 1 as an atomic operation each time they are encountered.  The Parallel.ForEach process executes against the wordChunks Blocking Collection removing mapping results (words) from the collection until all words have been processed.  The Blocking Collection will also cause to the Parallel.ForEach reduction process to "block" or wait for additional words when the collection becomes empty and the CompleteAdding() method has not yet been called by the producer (the mapWords method in our example program).  Using the Blocking Collection's GetConsumingEnumerable() method in the Parallel.ForEach loop is one way to trigger the blocking behavior.

    The C# Map Reduction Summary

    The previous figure of a map reduction process running on a single multi-processor computer can now be updated to reflect the C# objects and classes discussed in our example application.  Using only a few key C# components, parallel map reduction can be preformed with minimal effort when compared to creating parallel map reduction processes in other languages.  The figure below represents a map reduction process written in C# and running on a single multi-processor computer:

    MapReduce In C#Once all components of the map reduction process have been created, a small mapReduce method is written to bring the entire process altogether.  One of the most important parts of the mapReduce method is to create a background process for execution of the mapping function.  While the mapping function populates the Blocking Collection with mapping results in the background, the reduction function simultaneously removes / reduces the mapping results into the reduction data store.  Since all of this processing occurs in memory, the mapReduce method is extremely fast.  The following mapReduce program ties together the entire map reduction process for our example application:

    MapReduce MethodPrinting the results of your map reduction is also as simple as printing the contents of your wordStore dictionary:

    Display MapReduce Results


    Map reduction processing provides an innovative approach to the rapid consumption of very large and complex data processing tasks.  The C#  language is also very well suited for map reduction processing.  This type of processing is described by some in the C# community as a more complex form of producer / consumer pipelines. Typically, the largest potential constraint for any map reduction system is i/o.  During map reduction processing, i/o should be avoided at all costs and used only for saving final map reduction results, if possible.  When the processes described in this document are combined with data store sharding and map reduction partitioning processes, data processing tasks of most any size can be accommodated.  If you actually read this far, I'm impressed.  Thank you!


    Read more…

    Analysts use tools to perform various types of spatial analysis such as:

    • Cheapest Home Insurance location within 50 miles of Dallas
    • What locations are most amenable given the income, population & other demographics of a place within a 25 mile radius of New York City?
    • What zip codes have the highest crimes rates within 25 miles of Chicago?

    However, when we try and convert this analysis into a real life operational system that is high traffic with low latency and little room for error, most tools that perform well offline don’t live up to expectations.

    Amazon Web Service’s, the cloud computing platform from Amazon is one of the leaders in providing cloud based hosting solutions. Over the years, they have been steadily adding several software services to take advantage of their hardware platform. One search software service is the “AWS Cloud Search” that is itself used to power Amazon’s high performance e-commerce search. Now, the same technology can be used by other customers for searching.

    The below case study shows how AWS Cloud search can be used to perform geo-searching & spatial analytics to find the cheapest home insurance within a given area. The below heat map shows the varying home property values (and with it corresponding home insurance prices) across the nation.


    The sample data underlying the above heat map is given in the below table.

    The challenge from a real time operational geo-analytical spatial search perspective is to find the cheapest home insurance within a 200 miles radius of San Francisco.

    To do this with AWS Cloud Search, we need to first set up the search domain within AWS Cloud search based on the following broad activities:

    • Search Domain Creation & Configuration
    • Data Upload to your search domain. The Indexed fields in our data will include, home insurance rates, home prices, zip code & lat long details.
    • Data search within AWS Cloud Search & Controlling of Search results

    For calculating distances between places, we use Cosine search. Details on the math & logic behind the cosine search can be found here.

    Once we have the indexed the document and you want to return all places within a 200 mile radius of San Francisco, the query is below.

    dis_rank="&rank-dis=acos(sin(37.7833)*sin(3.141*lat/(1000000*180))%2Bcos(37.7833)*cos(3.141*lat/(1000000*180))*cos(-122.4167-(-3.141*(long-18100000)/(100000*180))))*6371*0.6214" ;threshold=”&t-dis=..200”

    The query is showing the places for San Franciso Lat Long coordinates which is given below.

    • latitude =37.7833
    • longitude = -122.4167
    • radius = 200

    When you pass this query to AWS Cloud Search, the speed of the response is on part with the best search engines in the world. The tuning & maintenance to operationalize such performance will take teams years to deliver.  So, when you do a cost & benefit analysis on operationalizing your real time spatial analytics, consider outsourcing key parts of your infrastructure to a search infrastructure that powers “Earth’s Largest Store”!


    This is analysis was written by, a home insurance data analytics service.

    DSC Resources

    Read more…
    Join us March 3, 2015 at 9am PST for our latest DSC's Webinar Series: Better Risk Management with Apache Hadoop and Red Hat sponsored by Hortonworks and Red Hat. Organization firms are operating under significantly increased regulatory requirements in the wake of the 2008 financial crisis. The risk management systems that each firm operates must respond not only to new reporting requirements but also handle ever-growing amounts of data to perform more comprehensive analysis. Existing systems that aren't designed to scale for today’s requirements can't finish reporting in time for start of trading. And, many of these systems are inflexible and expensive to operate.
    Read more…

    The curse of big data

    Originally posted on Data Science Central, by Dr. Vincent Granville

    This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.

    In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be

    • entirely caused by chance (just like someone who wins at the lottery wins purely by chance) and
    • not replicable,
    • having no predictive power,
    • but obscuring weaker patterns that are ignored yet have a strong predictive power.

    The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?

    Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.

    To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.

    Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.

    Let's introduce a new parameter, denoted as m, representing the number of paired (bi-variate), independent time series selected out of the set of k time series at our disposal: we want to compute correlations for these m pairs of time series. Theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being e.g. a simulated normalized stock price at a given time for two different stocks), what are the chances that among the m correlation coefficients, at least one is higher than 0.80?

    Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, let's assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically, that is approximately when n is not small) normal with mean = 0 and variance = 1/(n-1). The probability that r is larger than a given large number a (say a=0.80, meaning a strong correlation) is p=P(r>a) with P representing a normal distribution with mean = 0 and variance = 1/(n-1). The probability that, among the m bivariate (paired) time series, at least one has of correlation above a=0.80 is thus equal to 1-[(1-p)^m], that is, 1 minus (1-p) at power m.

    For instance,

    • If n=20, m=10,000 (10,000 paired time series each with 20 observations), then the chance that your conclusion is wrong (that is, a=0.80) is 90.93%.
    • If n=20, m=100,000 (still a relatively small value for m), then the chance that your conclusion is VERY wrong (that is, a=0.90) is 98.17%.

    Now in practice the way it works is as follows: you have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations, that is m = k*(k-1)/2. However the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a very high value such as a=0.90. But also, your data is not random numbers, it's not white noise. So the theoretical correlations are much above absolute 0, maybe around 0.15 when n=20. Also m will be much higher than (say) 10,000 or 100,000 even when you have as few as k=1,000 time series (say one time series for each stock price). These three factors (non independence, theoretical r different from 0, very large m) balance out and make my above computations still quite accurate when applied to a real typical big data problem. Note that I computed my probabilities using the online calculator stattrek.

    Conclusion: hire the right data scientist before attacking big data problems. He/she does not need to be highly technical, but able to think in a way similar to my above argumentation, to identify possible causes of model failures before even writing down a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of stats. Being a computer scientist also helps to scale your algorithms and make them simple and efficient. Being a MBA analyst also helps to understand the problem that needs to be solved. Being the three types of guy at the same time is even far better. And yes these guys do exist and are not that rare.


    Let's say you have 3 random variables X, Y, Z with corr(X,Y)=0.70, corr(X,Z)=0.80. What is the minimum value for corr(Y,Z). Can this correlation be negative?

    Related articles:

    Read more…

    Guest blog post by ajit jaokar

    Often, Data Science for IoT differs from conventional data science due to the presence of hardware.

    Hardware could be involved in integration with the Cloud or Processing at the Edge (which Cisco and others have called Fog Computing).

    Alternately, we see entirely new classes of hardware specifically involved in Data Science for IoT(such as synapse chip for Deep learning)

    Hardware will increasingly play an important role in Data Science for IoT.

    A good example is from a company called Cognimem which natively implements classifiers(unfortunately, the company does not seem to be active any more as per their twitter feed)

    In IoT, speed and real time response play a key role. Often it makes sense to process the data closer to the sensor.

    This allows for a limited / summarized data set to be sent to the server if needed and also allows for localized decision making.  This architecture leads to a flow of information out from the Cloud and the storage of information at nodes which may not reside in the physical premises of the Cloud.

    In this post, I try to explore the various hardware touchpoints for Data analytics and IoT to work together.

    Cloud integration: Making decisions at the Edge

    Intel Wind River edge management system certified to work with the Intel stack  and includes capabilities such as data capture, rules-based data analysis and response, configuration, file transfer and  Remote device management

    Integration of Google analytics into Lantronix hardware –  allows sensors to send real-time data to any node on the Internet or to a cloud based application.

    Microchip integration with Amazon Web services  uses an  embedded application with the Amazon Elastic Compute Cloud (EC2) service. Based on  Wi-Fi Client Module Development Kit . Languages like Python or Ruby can be used for development

    Integration of Freescale and Oracle which consolidates data collected from multiple appliances from multiple Internet of things service providers.


    Libraries are another avenue for analytics engines to be integrated into products – often at the point of creation of the device. Xively cloud services is an example of this strategy through xively libraries


    In contrast, provides APIs for IoT devices to create their own analytics engines ex (smartwatch Pebble’s using of  without locking equipment providers into a particular data architecture.

    Specialized hardware

    We see increasing deployment  of specialized hardware for analytics. Ex egburt from Camgian which uses sensor fusion technolgies for IoT.

    In the Deep learning space, GPUs are widely used and more specialized hardware emerges such asIBM’s synapse chip. But more interesting hardware platforms are emerging such as Nervana Systemswhich creates hardware specifically for Neural networks.

    Ubuntu Core and IFTTT spark

    Two more initiatives on my radar deserve a space in themselves – even when neither of them have currently an analytics engine:  Ubuntu Core – Docker containers+lightweight Linux distribution as an IoT OS and IFTTT spark initiatives

    Comments welcome

    This post is leading to vision for Data Science for IoT course/certification. Please sign up on the link if you wish to know more when launched in Feb.

    Image source: cognimem

    Read more…

    Internet of Things and Bayesian Networks

    Guest blog post by Punit Kumar Mishra

    As big data becomes more of cliche with every passing day, do you feel Internet of Things is the next marketing buzzword to grapple our lives.

    So what exactly is Internet of Thing (IoT) and why are we going to hear more about it in the coming days.

    Internet of thing (IoT) today denotes advanced connectivity of devices,systems and services that goes beyond machine to machine communications and covers a wide variety of domains and applications specifically in the manufacturing and power, oil and gas utilities.

    An application in IoT can be an automobile that has built in sensors to alert the driver when the tyre pressure is low. Built-in sensors on equipment's present in the power plant which transmit real time data and thereby enable to better transmission planning,load balancing. In oil and gas industry, it can help in planning better drilling, track cracks in gas pipelines.

    IoT will lead to better predictive maintenance in the manufacturing and utilities and this is will in turn lead to better control, track, monitor or back-up of the process. Even a small percentage improvement in machine performance can significantly benefit the company bottom line.

    IoT in some ways is to going to make our machines more brilliant and reactive.

    According to GE, 150 Billion dollars in waste across major industries can be eliminated by IoT.

    There can be questions that how is IoT different from a SCADA (supervisory control and data acquistion) systems which gets extensively used in the manfucturing industries.

    IoT can be considered to be an evolution on the data acquisition part of the SCADA systems.

    SCADA has been basically considered to be systems in silos with the data accessible to few people and not leading to long term benefit.

    IoT starts with embedding advanced sensors in machines and collecting the data for advanced analytics.

    As we start receiving data from the sensors , one important aspect that needs all the focus is the data transmitted correct or erroneous.

    How do we validate the data quality.

    We are dealing with uncertainty out here.

    One of the most commonly used methods for modelling uncertainty is Bayesian networks.

    Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph.

    Bayesian networks can be used extensively in Internet of things projects to ascertain data transmitted by the sensors.

    Read more…

    Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let's say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a click-out, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hanging fruits: click data. Other valuable data is impression data (for instance a click not associated with an impression is very suspicious). But impression data is huge, 20 times bigger than click data. We ignore impression data here.

    Here, we work with complete click data collected over a 7-day time period. Let's assume that we have 50 million clicks in the data set. Working with a sample is risky, because much of the fraud is spread across a large number of affiliates, and involve clusters (small and large) of affiliates, and tons of IP addresses but few clicks per IP per day (low frequency).

    The data set (ideally, a tab-separated text file, as CSV files can cause field misalignment here due to text values containing field separators) contains 60 fields: keyword (user query or advertiser keyword blended together, argh...), referral (actual referral domain or ad exchange domain, blended together, argh...), user agent (UA, a long string; UA is also known as browser, but it can be a bot), affiliate ID, partner ID (a partner has multiple affiliates), IP address, time, city and a bunch of other parameters.

    The first step is to extract the relevant fields for this quick analysis (a few days of work). Based on domain expertise, we retained the following fields:

    • IP address
    • Day
    • UA (user agent) ID - so we created a look-up table for UA's
    • Partner ID
    • Affiliate ID

    These 5 metrics are the base metrics to create the following summary table. Each (IP, Day, UA ID, Partner ID, Affiliate ID) represents our atomic (most granular) data bucket.

    Building a summary table: the Map step

    The summary table will be built as a text file (just like in Hadoop), the data key (for joins or groupings) being (IP, Day, UA ID, Partner ID, Affiliate ID). For each atomic bucket (IP, Day, UA ID, Partner ID, Affiliate ID) we also compute:

    • number of clicks
    • number of unique UA's
    • list of UA

    The list of UA's, for a specific bucket, looks like ~6723|9~45|1~784|2, meaning that in the bucket in question, there are three browsers (with ID 6723, 45 and 784), 12 clicks (9 + 1 + 2), and that (for instance) browser 6723 generated 9 clicks.

    In Perl, these computations are easily performed, as you sequentially browse the data. The following updates the click count:


    Updating the list of UA's associated with a bucket is a bit less easy, but still almost trivial.

    The problem is that at some point, the hash table becomes too big and will slow down your Perl script to a crawl. The solution is to split the big data in smaller data sets (called subsets), and perform this operation separately on each subset. This is called the Map step, in Map-Reduce. You need to decide which fields to use for the mapping. Here, IP address is a good choice because it is very granular (good for load balance), and the most important metric. We can split the IP address field in 20 ranges based on the first byte of the IP address. This will result in 20 subsets. The splitting in 20 subsets is easily done by browsing sequentially the big data set with a Perl script, looking at the IP field, and throwing each observation in the right subset based on the IP address.

    Building a summary table: the Reduce step

    Now, after producing the 20 summary tables (one for each subset), we need to merge them together. We can't simply use hash table here, because they will grow too large and it won't work - the reason why we used the Map step in the first place.

    Here's the work around:

    Sort each of the 20 subsets by IP address. Merge the sorted subsets to produce a big summary table T. Merging sorted data is very easy and efficient: loop over the 20 sorted subsets with an inner loop over the observations in each sorted subset; keep 20 pointers, one per sorted subset, to keep track of where you are in your browsing, for each subset, at any given iteration.

    Now you have a big summary table T, with multiple occurrences of the same atomic bucket, for many atomic buckets. Multiple occurrences of a same atomic bucket must be aggregated. To do so, browse sequentially table T (stored as text file). You are going to use hash tables, but small ones this time. Let's say that you are in the middle of a block of data corresponding to a same IP address, say (remember, T is ordered by IP address). Use


    to update (that is, aggregate click count) corresponding to atomic bucket (, Day, UA ID, Partner ID, Affiliate ID). Note one big difference between $hash_clicks and $hash_clicks_small: IP address is not part of the key in the latter one, resulting in hash tables millions of time smaller. When you hit a new IP address when browsing T, just save the stats stored in $hash_small and satellites small hash tables for the previous IP address, free the memory used by these hash tables, and re-use them for the next IP address found in table T, until you arrived at the end of table T.

    Now you have the summary table you wanted to build, let's call it S. The initial data set had 50 million clicks and dozens of fields, some occupying a lot of space. The summary table is much more manageable and compact, although still far too large to fit in Excel.

    Creating rules

    The rule set for fraud detection will be created based only on data found in the final summary table S (and additional high-level summary tables derived from S alone). An example of rule is "IP address is active 3+ days over the last 7 days". Computing the number of clicks and analyzing this aggregated click bucket, is straightforward, using table S. Indeed, the table S can be seen as a "cube" (from a database point of view), and the rules that you create simply narrow down on some of the dimensions of this cube. In many ways, creating a rule set consists in building less granular summary tables, on top of S, and testing. 


    IP addresses can be mapped to an IP category, and IP category should become a fundamental metric in your rule system. You can compute summary statistics by IP category. See details in my article Internet topology mapping. Finally, automated nslookups should be performed on thousands of test IP addresses (both bad and good, both large and small in volume).

    Likewise, UA's (user agents) can be categorized, a nice taxonomy problem by itself. At the very least, use three UA categories: mobile, (nice) crawler that identifies itself as a crawler, and other. The use of UA list such as ~6723|9~45|1~784|2 (see above) for each atomic bucket is to identify schemes based on multiple UA's per IP, as well as the type of IP proxy (good or bad) we are dealing with.

    Historical note: Interestingly, the first time I was introduced to a Map-Reduce framework was when  I worked at Visa in 2002, processing rather large files (credit card transactions). These files contained 50 million observations. SAS could not sort them, it would make SAS crashes because of the many and large temporary files SAS creates to do  big sort. Essentially it would fill the hard disk. Remember, this was 2002 and it was an earlier version of SAS, I think version 6. Version 8 and above are far superior. Anyway, to solve this sort issue - an O(n log n) problem in terms of computational complexity - we used the "split / sort subsets / merge and aggregate" approach described in my article.


    I showed you how to extract/summarize data from large log files, using Map-Reduce, and then creating an hierarchical data base with multiple, hierarchical levels of summarization, starting with a granular summary table S containing all the information needed at the atomic level (atomic data buckets), all the way up to high-level summaries corresponding to rules. In the process, only text files are used. You can call this an NoSQL Hierarchical Database (NHD). The granular table S (and the way it is built) is similar to the Hadoop architecture.

    Originally posted on DataScienceCentral, by Dr. Granville.

    Read more…

    Webinar Series

    Follow Us

    @DataScienceCtrl | RSS Feeds

    Data Science Jobs