Subscribe to our Newsletter

All Posts (76)

Guest blog post by Yuanjen Chen

At the CES 2015, I was fascinated by all sorts of possible applications of IoT – socks with sensors, mattresses with sensors, smart watches, smart everything – it seems like a scene in sci-fi movies has just come true. People are eager to learn more about what’s happening around them and now they can.


While I was at there I attended a talk given by David Pogue – he is awesome. He pointed out that the prevalence of smartphone is the key to the realization of the phenomenon called “Quantified Self.” I agreed with him. Smart phones play a vital role as a hub where all our personal data converge and present, seamlessly. The fact that you carry your smartphone around all the time and that the screen size perfectly reveals all the information results in a catalyst for wearable devices, IoT or what we like to call it, Intelligence of Things.


It’s all relevant; Big Data, IoT, Wearable, Cloud Computing… While most data is uploaded to the cloud, the client devices are generally powerful enough that the computing can be decentralized. That said, small data (client side) and big data (server side) form an eco-system where small data triggers the knowledge base cultivated by big data and does the predictive analysis and decision making in a timely manner. Furthermore, your smartphone gathers versatile data and is able to analyze cross-app data to personalize your application settings. For example, what about optimizing navigation based on my physical condition? Or how about suggesting the best route according to my health along with the weather? These individual data records might be small, but collectively they enrich the content of analysis and contribute some amazing value. We at BigObject really appreciate this context of Big Data.


Marc Andreessen once said, “I think we are all underestimating the impact of aggregated big data across many domains of human behavior, surfaced by smartphone apps.” For us here at BigObject, the next big thing in big data is to find out a methodology that can link multiple data sources together and identify the meaningful connections between that data. Most importantly it must be responsive enough to deliver actionable insight and simple enough for people to adopt. That is the key to fulfill a connected world. 

Read more…

Guest blog post by derick.jose

Each blow costs the Oil and Gas industry a Billion dollars.

Can we avoid it? Can we see it coming and take action? 

We also know that each operating rig consists of thousands of sensorsThis sensor data is used to analyse and reduce HSE (Health, Safety and Environment) risk considerably and dollars.

The current situation in the upstream side of oil n gas industry  is that there are a plethora of fragmented data streams  which have not been seen in a holistic fashion. They consist of Reserves Geospatial data, MWD  Measurement While Drilling data / Remotely steerable down hole tools - RPM, Down hole pressures from fibre optic sensors, Temperature sensors, Circulation solids, SCADA data  from Valve events and Pump events, Asset operating parameters, Out of condition alarms, Safety Incident data pools , Seismic Survey Data, Identity management logs ( Swipe in, Swipe outs), Contractor data points and Ambient conditions.  These can be broadly segmented into 2 data classes

  1. Velocity ( Real time streaming MWD, LWD, SCADA data streams ) and
  2. Variety ( Unstructured Reserves data, Geospatial data, Safety incident notes , Surveillance Video streams )

So what's the cost of data fragmentation in last mile ? This fragmentation prevents risk/safety specialists from seeing the risk context holistically and the right lenses are not in place to get the context and triangulate early warning patterns.

See the full blog on Oil n Gas Big Data Use Cases at


Read more…

Originally posted in pandas documentation website

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

The book can be downloaded here

Image result for pandas python

Read more…

Guest blog post by Michael Walker

The Internet of Things (IOT) will soon produce a massive volume and variety of data at unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's soul.


Let's define our terms:


Internet of Things (IOT): equipping all physical and organic things in the world with identifying intelligent devices allowing the near real-time collecting and sharing of data between machines and humans. The IOT era has already begun, albeit in it's first primitive stage.
Data Science: the analysis of data creation. May involve machine learning, algorithm design, computer science, modeling, statistics, analytics, math, artificial intelligence and business strategy.
Big Data: the collection, storage, analysis and distribution/access of large data sets. Usually includes data sets with sizes beyond the ability of standard software tools to capture, curate, manage, and process the data within a tolerable elapsed time. 
We are in the pre-industrial age of data technology and science used to process and understand data. Yet the early evidence provides hope that we can manage and extract knowledge and wisdom from this data to improve life, business and public services at many levels. 
To date, the internet has mostly connected people to information, people to people, and people to business. In the near future, the internet will provide organizations with unprecedented data. The IOT will create an open, global network that connects people, data and machines. 
Billions of machines, products and things from the physical and organic world will merge with the digital world allowing near real-time connectivity and analysis. Machines and products (and every physical and organic thing) embedded with sensors and software - connected to other machines, networked systems, and to humans - allows us to cheaply and automatically collect and share data, analyze it and find valuable meaning. Machines and products in the future will have the intelligence to deliver the right information to the right people (or other intelligent machines and networks), any time, to any device. When smart machines and products can communicate, they help us and other machines understand so we can make better decisions, act fast, save time and money, and improve products and services.
The IOT, Data Science and Big Data will combine to create a revolution in the way organizations use technology and processes to collect, store, analyze and distribute any and all data required to operate optimally, improve products and services, save money and increase revenues. Simply put, welcome to the new information age, where we have the potential to radically improve human life (or create a dystopia - a subject for another time).
The IOT will produce gigantic amounts of data. Yet data alone is useless - it needs to be interpreted and turned into information. However, most information has limited value - it needs to be analyzed and turned into knowledge. Knowledge may have varying degrees of value - but it needs specialized manipulation to transform into valuable, actionable insights. Valuable, actionable knowledge has great value for specific domains and actions - yet requires sophisticated, specialized expertise to be transformed into multi-domain, cross-functional wisdom for game changing strategies and durable competitive advantage.
Big data may provide the operating system and special tools to get actionable value out of data, but the soul of the data, the knowledge and wisdom, is the bailiwick of the data scientist.
Read more…

Join us April 14th at 9am PDT for our latest DSC's Webinar Series: The Science of Segmentation: What Questions You Should be Asking Your Data? sponsored by Pivotal.

Space is limited.

Reserve your Webinar seat now

Enterprise companies starting the transformation into a data-driven organization ​often ​​wonder where to start. Companies have traditionally collected large amounts of data from sources such as operational systems. With the rise of big data, big data technologies and the ​I​nternet of ​T​hings​ (IoT)​, additional sources​ – such as sensor readings and social media posts​ – are rapidly becoming available. In order to effectively utilize both traditional sources ​and new ones, companies first need to join and view the data in a holistic context. After establishing a data lake to bring all data sources together in a single analytics environment, one of the first data science projects ​worth exploring is segmentation​, which automatically identif​ies​ patterns.

In this DSC webinar, two Pivotal data scientists will discuss:

  • What segmentation is
  • Traditional approaches to segmentation
  • How big data technologies are enabling advances in this field

They will also share some stories from past ​d​ata ​s​cience ​engagements, ​outline ​best practices and discuss the kinds of insights ​that can be derived from a big data approach to segmentation using both internal and external data sources.

Grace Gee, Data Scientist​ -- Pivotal​
Jarrod Vawdrey, Data Scientist -- Pivotal​

Hosted by:
Tim Matteson, Co-Founder -- Data Science Central

Again, Space is limited so please register early:
Reserve your Webinar seat now


After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Pivotal logo
Pivotal Big Data Roadshow

Join data technology experts from Pivotal to get the latest perspective on how big data analytics and applications are transforming organizations across industries.

This event provides an opportunity to learn about new developments in the rapidly-changing world of big data and understand best practices in creating Internet of Things (IoT) applications.

Additionally, attendees will engage in hands-on data science and application development training using the market-leading Pivotal Big Data Suite.

Who Should Attend
In addition to addressing industry trends in big data, the sessions and hands-on workshop cover topics of interest to developers, architects, data scientists and technical managers.

8:00 AM – 9:00 AM
Check-In & Breakfast
9:00 AM – 10:15 AM
The Journey to Becoming a Data-Driven Enterprise
10:15 AM – 10:30 AM
Coffee Break
10:30 AM – 12:00 PM
Internet of Things Demo and Architecture Overview
12:00 PM – 1:00 PM
Lunch & Birds of a Feather Discussion
1:00 PM – 3:00 PM
Hands-On Technical Workshop

Register now for this exclusive event. The workshop is FREE; space is limited.

March 24, 2015
Bridgewater Marriott
700 Commons Way
Bridgewater, NJ 08807

Pivotal offers a modern approach to technology that organizations need to thrive in a new era of business innovation. Our solutions intersect cloud, big data and agile development, creating a framework that increases data leverage, accelerates application delivery, and decreases costs, while providing enterprises the speed and scale they need to compete. More at

© Pivotal, and the Pivotal logo are registered trademarks or trademarks of Pivotal Software, Inc. in the United States and other countries. All other trademarks used herein are the property of their respective owners. © 2015 Pivotal Software, Inc. All rights reserved. Published in the USA.

Read more…

Guest blog post by Fari Payandeh


Fari Payandeh

Fari Payandeh

Aug 12, 2013

May I say at the outset that I know the phrase “Data Suction Appliance” sounds awkward at its best and downright awful at its worst. Honestly, I don’t feel that bad! These are some of the words used in Big Data products or company names: story, genie, curve, disco, rhythm, deep, gravity, yard, rain, hero, opera, karma… I won’t be surprised if I come across a  start-up named WeddingDB next week.

 Although there is so much hype surrounding social media data, the real goldmine is in the existing RDBMS Databases and to a lesser degree in Mainframes. The reason is obvious. Generally speaking data capture has been driven by business requirements, and not by some random tweets about where to meet for dinner.  In short, the Database vendors are sitting on top of the most valuable data.

 Oracle, IBM, and Microsoft “own” most of the data in the world. By that I mean if you run a query in any part of the world,  it’s very likely that you are reading the data from a Database owned by them. The larger the volume of data, the greater the degree of ownership; just ask anyone who has attempted to migrate 20 TB of data from Oracle to DB2. In short, they own the data because the customers are locked-in. Moreover, the real value of data is much greater than the revenues generated from the Database licenses. In all likelihood the customer will buy other software/applications from the same vendor since it’s a safe choice. From the Database vendors’ standpoint the Database is a gift that keeps on giving. Although they have competed for new customers, due to absence of external threats (Non-RDBMS technology), they have enjoyed being in a growing market that has kept them happy. Teradata, MySql (Non-Oracle flavors), Postgres, and Sybase have a small share of the overall Database market.

 The birth of Hadoop and NoSql technology represented a seismic shift that shook the RDBMS market not in terms of revenue loss/gain, but in offering an alternative to businesses . The Database vendors moved quickly to jockey for position and contrary to what some believe, I don’t think they were afraid of a meltdown. After all who was going to take their data? They responded to the market lest they be deprived of the Big Data windfall.

 IBM spent $16 billion on its Big Data portfolio and launched PureData for Hadoop; a hardware/software system composed of IBM Big Data stack. It introduced SmartCloud and recently backed Pivotal’s Cloud Foundry.  Cloud Foundry is "like an operating system for the cloud," Andy Piper, developer advocate for Cloud Foundry at Pivotal.

 Microsoft HDInsight products integrate with Sql Server 2012, System Center, and other Microsoft products; the Azure cloud-based version integrates with Azure cloud storage and Azure Database.

 Oracle introduced Big Data appliance bundle comprising Oracle NoSql Database, Oracle Linux, Cloudera Hadoop, and Hotspot Java Virtual Machine. It also offers Oracle Cloud Computing.

 What is Data Suction Appliance? There is a huge market for a high performance data migration tool that can copy the data stored in RDBMS  Databases to Hadoop.  Currently there are no fast ways of transferring data  to Hadoop; Performance is sluggish. What I envision is data transfer at the storage layer and not Database layer. Storage vendors such as EMC and NetApp  have an advantage in finding a solution while working with Data Integration vendors like Informatica. Informatica recently partnered with VelociData, provider of hyper-scale/hyper-speed engineered solutions. Is it possible? I would think so. I know that I am simplifying the process, but this is a high level view of what I see as a possible solution. Database objects are stored at specific disk addresses. It starts with the address of an instance within which the information about the root Tablespace or Dbspace is kept. Once the root Tablespace is identified, the information about the rest of the objects (Non-root Tablespaces, tables, indexes, …) is available in Data Dictionary tables and views. This information includes the addresses of the data files. Data file headers store the addresses of free/used extents and we continue on that path until data blocks containing the target rows are identified. Next, the Data Suction Appliance bypasses the Database and bulk copies the data blocks from storage to Hadoop. Some transformations may be needed during data transfer in order to bring in the data in a way that NoSql Databases can understand, but that can be achieved through an interface which will allow the Administrators to specify the data transfer options.  The future will tell if I am dreaming or as cousin Vinny said, "The argument holds water".

Read more…

Guest blog post by Vincent Granville

Theo: One idea is that you must purchase a number of transactions before using the paid service, and add dollars regularly. A transaction is a call to the API.

The service is accessed via an HTTP call that looks like

When the request is executed,

  • First the script checks if client has enough credits (dollars)
  • If yes it fetches the data on the client web server: the URL for the source data is yyy
  • Then the script checks if source data is OK or invalid, or client server unreachable
  • Then it executes the service zzz, typically, a predictive scoring algorithm
  • The parameter field tells whether you train your predictor (data = training set) or whether you use it for actual predictive scoring (data outside the training set)
  • Then it processes data very fast (a few secs for 1MM observations for the training step)
  • Then it sends an email to client when done, with the location (on the datashaping server) of the results (the location can be specified in the API call, as an additional field, with a mechanism in place to prevent file collisions from happening)
  • Then it updates client budget

Note all of this can be performed without any human interaction. Retrieving the scored data can be done with a web robot, and then integrated into the client's database (again, automatically). Training the scores would be charged much more than scoring one observation outside the training set. Scoring one observation is a transaction, and could be charged as little as $0.0025.

This architecture is for daily or hourly processing, but could be used for real time if parameter is not set to "training". However, when designing the architecture, my idea was to process large batches of transactions, maybe 1MM at a time.
Read more…

One of the most valuable tools that I've used, when performing exploratory analysis, is building a data dictionary. It offers the following advantages:

  • Identify areas of sparsity and areas of concentration in high-dimensional data sets
  • Identify outliers and data glitches
  • Get a good sense of what the data contains, and where to spend time (or not) in further data mining

What is a data dictionary

A data dictionary is a table with 3 or 4 columns. The first column represents a label: that is, the name of a variable, or a combination of multiple (up to 3) variables. The second column is the value attached to the label: the first and second columns actually constitute a name-value pair. The third column is a frequency count: it measures how many times the value (attached to the label in question) is found in the data set. You can add a 4-th column, that tells the dimension of the label (1 if it represents one variable, 2 if it represents a pair of two variables etc.)

Typically, you include all labels of dimension 1 and 2 with count > threshold (e.g. threshold = 5), but no or only very few values (the ones with high count) for labels of dimension 3. Labels of dimension 3 should be explored after having built the dictionary for dim 1 and 2, by drilling down on label/value of dim 2, that have a high count.

Example of dictionary entry

category~keyword travel~Tokyo 756 2

In this example, the entry corresponds to a label of dimension 2 (as indicated in column 4), and the simultaneous combination of the two values (travel, Tokyo) is found 756 times in the data set.

The first thing you want to do with a dictionary is to sort it using the following 3-dim index: column 4, then column 1, then column 3. Then look at the data and find patterns.

How do you build a dictionary

Browse your data set sequentially. For each observation, store all label/value of dim 1 and dim 2 as hash table keys, and increment count by 1 for each of these label/value. In Perl, it can be performed with code such as $hash{"$label\t$value"}++. 

If the hash table grows very large, stop, save the hash table on file then delete it in memory, and resume where you paused, with a new hash table. At the end, merge hash tables after ignoring hash entries where count is too small.

Originally posted in by Vincent Granville

Read more…

R in your browser

This blog was originally posted in by Mirko Krivanek

You have to see it to believe it. Go to and you can enter R commands in the browser-embedded console. Wondering how easy it would be to run R from a browser, on your iPad. I'm not sure how you would import data files, but I suppose R offers the possibility to open a file located on a web or ftp server, rather than a local file stored on your desktop. Or does it not? Also, it would be cool to have Python in a browser.

Related article


Read more…

Big Data, IOT and Security - OH MY!

Guest blog post by Carla Gentry

While we aren’t exactly “following the yellow brick road” these days, you may be feeling a bit like Dorothy from the “Wizard of Oz” when it comes to these topics. No my friend, you aren’t in Kansas anymore! As seem above from Topsy, these three subjects are extremely popular these days and for the last 30 days seem to follow a similar pattern (coincidence?).


The internet of things is not just a buzzword and is no longer a dream, with sensors abound. The world is on its way to become totally connected, although it will take time to work out a few kinks here and there (with a great foundation, you create a great product; this foundation is what will take the most time). Your appliances will talk to you in your “smart house” and your “self-driving car” will take you to your super tech office where you will work with ease thanks to all the wonders of technology. But let’s step back to reality and think, how is all this going to come about, what will we do with all the data collected and how will we protect it?


First thing first is all the sensors have to be put in place, and many questions have to be addressed. Does a door lock by one vendor communicate with a light switch by another vendor, and do you want the thermostat to be part of the conversation and will anyone else be able to see my info or get into my home?

How will all the needed sensors be installed and will there be any “human” interaction? It will take years to put in place all the needed sensors but there are some that are already engaging in the IOT here in the US. Hotels (as an example but not the only one investing in IOT) are using sensors connected to products that they are available for sale in each room, which is great but I recently had an experience with how “people” are the vital part of “IOT” – I went to check out of a popular hotel in Vegas, when I was asked if I drank one of the coffees in the room, I replied, “no, why” and was told that the sensor showed that I had either drank or moved the coffee, the hotel clerk verified that I had “moved” and not “drank” the coffee but without her, I would have been billed and had to refute the charge. Refuting charges are not exactly good for business and customers service having to handle “I didn’t purchase this” disputes 24/7 wouldn’t exactly make anyone’s day, so thank goodness for human interactions right there on the spot.


“The Internet of Things” is not just a US effort - Asia, in my opinion, is far ahead of the US, as far as the internet of things is concerned. If you are waiting in a Korean subway station, commuters can browse and scan the QR codes of products which will later be delivered to their homes. (Source: Tesco) - Transport for London’s central control centers use the aggregated sensor data to deploy maintenance teams, track equipment problems, and monitor goings-on in the massive, sprawling transportation systemTelent’s Steve Pears said in a promotional video for the project that "We wanted to help rail systems like the London Underground modernize the systems that monitor it’s critical assets—everything from escalators to lifts to HVAC control systems to CCTV and communication networks." The new smart system creates a computerized and centralized replacement for a public transportation system that used notebooks and pens in many cases.


But isn't the Internet of Things too expensive to implement? Many IoT devices rely on multiple sensors to monitor the environment around them. The cost of these sensors declined 50% in the past decade, according to Goldman Sachs. We expect prices to continue dropping at a steady rate, leading to an even more cost-effective sensor.



The Internet of Things is not just about gathering of data but also about the analysis and use of data. So all this data generated by the internet of thing, when used correctly, will help us in our everyday life as consumer and help companies keep us safer by predicting and thus avoiding issues that could harm or delay, not to mention the costs that could be reduced from patterns in data for transportation, healthcare, banking, the possibilities are endless.


Let’s talk about security and data breaches – Now you may be thinking I’m in analytics or data science why should I be concerned with security? Let’s take a look at several breaches that have made the headlines lately.


Target recently suffered a massive security breach thanks to attacker infiltrating a third party. and so did Home depot PC world said “Data breach trends for 2015: Credit cards, healthcare records will be vulnerable



Sony was hit by hackers on Nov. 24, resulting in a company wide computer shutdown and the leak of corporate information, including the multimillion-dollar pre-bonus salaries of executives and the Social Security numbers of rank-and-file employees. A group calling itself the Guardians of Peace has taken credit for the attacks.


So how do we protect ourselves in a world of BIG DATA and the IOT?
Why should I – as a data scientist or analyst be worried about security, that’s not really part of my job is it? Well if you are a consultant or own your own business it is! Say, you download secure data from your clients and then YOU get hacked, guess who is liable if sensitive information is leaked or gets into the wrong hands? What if you develop a platform where the client’s customers can log in and check their accounts, credit card info and purchase histories are stored on this system, if stolen, it can set you up for a lawsuit. If you are a corporation, you are protected in some extents but what if you operate as a sole proprietor – you could lose your home, company and reputation. Still think security when dealing with big data isn’t important?

Organizations need to get better at protecting themselves and discovering that they’ve been breached plus we, the consultants, need to do a better job of protecting our own data and that means you can’t use password as a password! Let’s not make it easy for the hackers and let’s be sure that when we collect sensitive data and yes, even the data collected from cool technology toys connected to the internet, that we are security minded, meaning check your statements, logs and security messages - verify everything! When building your database, use all the security features available (masking, obfuscation, encryption) so that if someone does gain access, what they steal is NOT usable!


Be safe and enjoy what tech has to offer with peace of mind and at all cost, protect your DATA.


I’ll leave you with a few things to think about:

“Asset management critical to IT security”
"A significant number of the breaches are often caused by vendors but it's only been recently that retailers have started to focus on that," said Holcomb. "It's a fairly new concept for retailers to look outside their walls." (Source:


“Data Scientist: Owning Up to the Title”
Enter the Data Scientist; a new kind of scientist charged with understanding these new complex systems being generated at scale and translating that understanding into usable tools. Virtually every domain, from particle physics to medicine, now looks at modeling complex data to make our discoveries and produce new value in that field. From traditional sciences to business enterprise, we are realizing that moving from the "oil" to the "car", will require real science to understand these phenomena and solve today's biggest challenges. (Source:



Forget about data (for a bit) what’s your strategic vision to address your market?

Where are the opportunities given global trends and drivers? Where can you carve out new directions based on data assets? What is your secret sauce? What do you personally do on an everyday basis to support that vision? What are your activities? What decisions do you make as a part of those activities? Finally what data do you use to support these decisions?


Read more…

What MapReduce can't do

Guest blog post by Vincent Granville

We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.

MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.

MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.

However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

The Problem

Let's say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is up tomorrow.

You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help. I'll discuss another example (keyword taxonomy) later in this article.

Three Solutions

Here I propose three solutions:

1. Sampling

Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).

I'll soon publish an article that shows how approximate solutions (a local optimum) to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2x2 combinations (as in this correlation issue), but 3x3, 4x4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.

2. Binning

If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it's much faster: for each pair you get the correlation from the lookup table - no computation required, only accessing a value in a hash table or an array with 512 cells.

Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.

3. Classical data reduction

Traditional reduction techniques can also be used: forward or backward step-wise techniques where (in turn) you add or remove one variable at a time (or maybe two). The variable added is chosen to maximize the resulting entropy, and conversely for variables being removed. Entropy can be measured in various ways. In a nutshell, if you have two data subsets (from the same large data set),

  • A set A with 100 variables, which is 1.23 GB when compressed, 
  • A set B with 500 variables, including the 100 variables from set A, which is 1.25 GB when compressed

Then you can say that the extra 400 variables (e.g. stocks symbols) in set B don't bring any extra predictive power and can be ignored. Or in other words, the lift obtained with the set B is so small that it's probably smaller than the noise inherent to these stock price signals.

Note: An interesting solution consists of using a combination of the three previous strategies. Also, be careful to make sure that the high correlations found are not an artifact caused by the "curse of big data" (see reference article below for details).

Another example where MapReduce is of no use

Building a keyword taxonomy:

Step 1:

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.

A="California insurance", B="home insurance", x=543, y=998, z=11


  • x is the number of occurrences of keyword A in all the web pages that you crawled
  • y is the number of occurrences of keyword B in all the web pages that you crawled
  • z is the number of occurences where A and B form a pair (e.g. they are found on a same page)

This "keyword pair" table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a "keyword pair", in other words, z=0. So by ignoring these null entries, your "keyword pair" table is still manageable, and might contain as little as 50 million entries.

Step 2:

To create a taxonomy, you want to put these keywords into similar clusters. One way to do it is to compute a dissimilarity d(A,B) between two keywords A, B. For instances d(A, B) = z / SQRT(x * y), although other choices are possible. The higher d(A, B), the closer keywords A and B are to each other. Now the big problem is to perform clustering - any kind of clustering, e.g. hierarchical - on the "keyword pair" table, using any kind of dissimilarity. This problem, just like the correlation problem, can not be split into sub-problems (followed by a merging step) using MapReduce. Why? Which solution would you propose in this case?

Related articles:

Read more…

Guest blog post by Don Philip Faithful

The idea of environmental determinism once made a lot of sense. Hostile climates and habitats prevented the expansion of human populations. The conceptual opposite of determinism is called possibilism. These days, human populations can found living in many inhospitable habitats. This isn't because humans have physically evolved. But rather, we normally occupy built-environments. We exist through our technologies and advanced forms of social interaction: a person might not be able to build a house, but he or she can arrange for financing to have a house constructed. "Social possibilism" has enabled our survival in inhospitable conditions. Because humans today almost always live within or in close proximity to built-environments, among the most important factors affecting human life today is data. The systems that support human society make use of data in all of its multifarious forms; this being the case, data science is important to our continuation and development as a species. This blog represents a discussion highlighting the need for a universal data model. I find that the idea of "need" is highly subjective; and perhaps the tendency is to focus on organizational needs specifically. I don't dispute the importance of such a perspective. But I hope that readers consider the role of data on a more abstract level in relation to social possibilism. It is this role that the universal data model is meant to support. Consider some barriers or obstacles that underline the need for a model, listed below.

Barriers to Confront

I certainly don't suggest that in this blog that I am introducing the authoritative data model to end all models. Quite the contrary, I feel that my role is to help promote discussion. I imagine even in the list of barriers, there might be some disagreement among data scientists.

(1) Proxy reductionism triggered by instrumental needs: I believe some areas of business management have attempted to address highly complex phenomena through the simplification of proxies (i.e. data). The nominal representation of reality facilitates production, but also insulates an organization from its environment. Thus production can occur disassociated from surrounding phenomena. I feel that this nominalism is due to lack of a coherent model to connect the use of data to theory.  We gain the illusion of progress through greater disassociation, exercising masterful control over data while failing to take into account and influence real-life conditions.

(2) Impairment from structurally inadequate proxies: Upon reducing a problem through the use of a primitive proxies, an organization might find development less accessible. I believe that a data model can help in the process of diagnosis and correction. I offer some remedial actions likely applicable to a number of organizations: i) collection of adequate amounts of data; ii) collection of data of greater scope; and iii) ability to retain the contextual relevance of data.

Social Disablement Perspective

My graduate degree is in critical disability studies - a program that probably seems out-out-place in relation to data science. Those studying traditional aspects of disability might argue that this discipline doesn't seem to involve big data, algorithms, or analytics. Nonetheless, I believe that disablement is highly relevant in the context of data science albeit perhaps in a conceptual sense. While there might not be people with apparent physical or mental disabilities, there are still disabling environments. Organizations suffering from an inability to extract useful insights from their data might not be any more disabled than the data scientist surrounding by tools and technologies disassociated from their underlying needs. Conversely, those in the field of disability might discuss the structural entrenchment of disablement without ever targeting something as pervasive as data systems. However, for those open to different perspectives, I certainly discuss aspects of social disablement in my blogs all the time. Here, I will be arguing that at its core, data is the product of two forces in a perpetual tug-of-war: disablement and participation. So there you go. I offer some cartoon levity as segue.

I recently learned that the term "stormtroopers" has been used to describe various military forces. For the parable, assume that I mean Nazi shock troops. I'm uncertain how many of my peers have the ability to write computer programs. I create applications from scratch using a text editor. Another peculiarity of mine is the tendency to construct and incorporate elaborate models into my programming. It is never enough for a program to do something. I search for a supporting framework. Programming for me is as much about research through framework-development as it is about creating and running code. In the process of trying to communicate models to the general public, I sometimes come up with examples that I admit are a bit offbeat. Above in the "Parable of the Stormtrooper and the Airstrip," I decided to create personifications to explain my structural conceptualization of data. The stormtrooper on the left would normally be found telling people what to do. Physical presence or presence by physical proxy is rather important. (I will be using the term "proxy" quite frequently in this blog.) He creates rules or participates in structures to impose those rules. He hollers at planes to land on his airstrip. I chose this peculiar behaviour deliberately. Command for the soldier is paramount, effectiveness perhaps less so. In relation to the stormtrooper, think social disablement; this is expressed on the drawing as "projection."

On the other side of the equation is this person that sort of resembles me and who I have identified as me although this is a personification of an aspect of data. He is not necessarily near or part of the enforcement regime. His objective rather than to compel compliance is to make sense of phenomena: he aims to characterize and convey it especially those aspects of reality that might be associated with but not necessarily resulting from the activities of the stormtrooper. There are no rules for this individual to impose. Nor does he create structures to assert his presence over the underlying phenomena. In his need to give voice to phenomena, he seeks out "ghosts" through technology. If this seems a bit far-fetched, at least think of him as a person with all sorts of tools designed to detect events that are highly evasive. Perhaps his objective is to monitor trends, consumer sentiment, heart palpitations, or patterns leading to earthquakes. Participation is indicated on the drawing as "articulation."

So how is a model extracted from this curious scene? I added to the drawing what I will refer to as the "eye": data appears in the middle surrounded by projection and articulation. Through this depiction, I am saying that data is often never just plain data. It is a perpetual struggle between the perceiver and perceived. I think that many people hardly give "data" much thought: e.g. here is a lot of data; here is my analysis; and here are the results. But let us consider the idea that data is actually quite complicated from a theoretical standpoint. I will carry on this discussion using an experiment. The purpose of this experiment is not to arrive at a conclusion but rather perceive data in its abstract terms.

An Experiment with No Conclusion

A problem when discussing data on an abstract level is the domain expertise of individuals. I realize this is an ironic position to take given so many calls for greater domain expertise in data science. The perspective of a developer is worth considering: he or she often lacks domain expertise, and yet this person is responsible for how different computer applications make use of data. Consequently, in order to serve the needs of the market, it is necessary for the developer to consider how "people" regard the data. Moreover, the process of coding imposes distance or abstraction since human mental processes and application processes are not necessarily similar. A human does not generate strings from bytes and store information at particular memory addresses. But a computer must operate within its design parameters. The data serves human needs due to the developer's transpositional interpretation of the data. The developer shapes the manner of conveyance, defines the structural characteristics of the data, and deploys it to reconstruct reality.

I have chosen an electrical experiment. There is a just single tool, a commercial grade voltmeter designed to detect low voltages. The voltage readings on this meter often jump erratically when I move it around a facility full of electrical devices; this behaviour occurs when the probes aren't touching anything. Now, the intent in this blog is not to determine the cause of the readings. I just want readers to consider the broader setting. Here is the experiment: with the probes sitting idle on a table, I took a series of readings at two different times of the day. The meter detected voltage - at first registering negative then becoming positive after about a minute. As indicated below on the illustration, these don't appear to be random readings. Given that there is data, what does it all mean? The meter is reading electrical potential, and this is indeed the data. What is the data in more abstract terms regardless of the cause?

Being a proxy is one aspect of data. Data is a constructed representation of its underlying phenomena: the electrical potential is only a part of the reality captured in this case by the meter. The readings from the meter define and constrain the meaning of the data such that it only relates to output of the device. In other words, what is the output of the device? It is the data indicated on the meter. It is a proxy stream; this is what we might recognize in the phenomena; for this is what we obtain from the phenomena using the meter. From the experiment itself, we actually gain little understanding of the phenomena. We only know its electrical readings. So although the data is indeed some aspect of the articulated reality, this data is more than anything a projection of how this reality is perceived. It is not my intention to dismiss the importance of the meter readings. However, we would have to collect far more data to better understand the phenomena. Our search cannot be inspired by the meter readings alone; it has to be driven by the phenomena itself.

Another problem relates to how the meter readings are interpreted. Clearly the readings indicate electrical potential; so one might suggest that the data provides us with an understanding of how much potential is available. The meter provides data not just relating to electrical potential alone but also dynamic aspects of the phenomena: its outcomes, impacts, and consequences. This is not to say that electrical potential is itself the outcome or antecedent of an outcome; but it is part of the reality of which the device is designed to provide only readings of potential. We therefore should distinguish between data as a proxy and the underlying phenomena, of which the data merely provides a thin connection or conduit. There is a structure or organizational environment inherent in data that affects the nature and extent to which the phenomena is expressed. The disablement aspect confines phenomena to contexts that ensure the structure fulfills instrumental requirements. Participation releases the contextual relevance of data.

Initial Conceptualization

I have met people over the years that refuse to question pervasive things. I am particularly troubled by the expression "no brainer." If something is a no-brainer, it hardly makes sense to discuss it further; so I imagine these people sometimes avoid deliberating over the nature of things. This strategy is problematic from a programming standpoint where it is difficult to hide fundamental lack of knowledge. It then becomes apparent that the "no brainer" might be the person perceiving the situation as such. Keeping this interpretation of haters and naysayers in mind, let's consider the possibility that it actually takes all sorts of brains to characterize data - that in fact the task can incapacitate both people and supercomputers. If somebody says, "Hey, that's a no brainer" to me or anybody else, my response will be, "You probably mean that space in your head!"  (Shakes fist at air.)

I provide model labels on the parable image: projection, data, and articulation. I generally invoke proper terms for aspects of an actual model. "Disablement" can be associated with "projection" on the model; and "participation" with the term "articulation." The conceptual opposition is indicated on the image below as point #1. Although the parable makes use of personifications, there can sometimes be entities in real-life doing the projection: e.g. the oppressors. There can also be real people being oppressed. In an organizational context, the issue of oppression is probably much less relevant, but the dynamics still persist between the definers and those being defined: e.g. between corporate strategists and consumers. Within my own graduate research, I considered the objectification of labourers and workers. As production environment have developed over the centuries, labour has become commodified. In the proxy representation, workers have been "defined" using the most reified metrics; but there is a counterforce also, for self-definition or some level of autonomy. Data exists within a context of competing interests as indicated on point #2

From the experiment I indicated how data is like a continuum formed by phenomena and its radiating consequences: I said that readings can be taken of dynamic processes. This is a bit like throwing stones in a lake and having sensors detect ripples and counter-ripples. An example would be equity prices in the stock market where a type of algorithmic lattice can bring to light the dynamic movement of capital. Within this context, it is difficult to say whether what we are measuring is more consequence or antecedent; but really it is both. I believe it is healthy to assume that the data-logger or reading device offers but the smallest pinhole to view the universe on the other side. Point #3 shows these additional dynamics. There is a problem here in terms of graphical portrayal - how to bring together all three points into a coherent model. I therefore now introduce the universal data model. I also call this the Exclamation Model or the Model! The reasons will be apparent shortly.


The Exclamation Model visually resembles an exclamation mark, as shown on the image below. For the purpose of helping readers navigate, I describe the upper portion of the model as "V" and the lower part as "O," or "the eye" as I mentioned previously since it resembles a human eye. The model attempts to convey all of the different things that people tend to bundle up in data perhaps at times subconsciously. An example I often use in my blogs is sales data, which doesn't actually tell us much about why consumers buy products. There might be high demand one year followed by practically no demand the next; yet analysts try to plot sales figures as if the numbers follow some sort of intrinsic force or built-in natural pattern. Sales figures do not represent an articulation of the underlying phenomena, but rather it causes externally desired aspects of the phenomena to conform to an interpretive framework. Within any organizational context, there is a battle to dictate the meaning of data. If an organization commits itself to the collection of sales data and nothing beyond this to understand its market, it would be difficult at a later time to find a suitable escape route leading away from the absence of information. The eye is inherent in the structure of data extending in part from the authority and control of those initiating its collection.

As one goes up the V, both projection and articulation transform to accommodate the increasing complexity of the phenomena; but also while going up, there is greater likelihood of separation between the articulated reality (e.g. employee stress) and the instrumental projection (e.g. performance benchmarks) resulting in different levels of alienation. As one travels down the V, there is less detachment amid declining complexity, which improves the likelihood of inclusion. In this explanation, I am not suggesting that alienation or inclusion is directly affected by the level of sophistication in the data. The V can become narrower or wider depending on design. Complexity itself does not cause alienation between data and its phenomena; but there is greater need for design to take complexity into account due to the risk of alienation. It might be tempting to apply this model to social phenomena directly, but actually this is all meant for the data that is conveying phenomena. Data can be alienated from complex phenomena.

Rooted in Systems Theory

I realize that the universal data model doesn't resemble a standard input-process-output depiction of a system; but actually it is systemic. Projection provides the arrow for productive processes sometimes portrayed in a linear fashion: input, process, and output. Articulation represents what has often been described as "feedback." Consequently, the eye suggests that the entire system is a source of data. In another blog, I support this argument by covering three major data types that emerge in organizations: data related to projection resulting from metrics of criteria; data from routine operations as part of production processes; and data from articulation from the metrics of phenomena. The eye is rather like a conventional system viewed from a panoramic lens. The V provides an explanation of the association between proxies and phenomena under different circumstances.

Arguments Regarding Evidence

The simplification movement has mostly been about simplification of proxies and not the underlying phenomena. Data as a proxy is made simpler in relation to what it is meant to represent. Consider a common example: although employees might have many attributes and capabilities, in terms of data they are frequently reduced to hours worked. The number of hours worked is a metric intended to indicate the cost of labour. A data system might retain data focused on the most instrumental aspects of production thereby giving the illusion that an organization is only responsible for production. I feel that as society becomes more complex and the costs associated with data start to decline in relation to the amount of data collected, the obligation that an organization has to society will likely increase. This obligation will manifest itself in upgrades to data systems and not only this but improved methodologies surrounding the collection and handling of data. The model provides a framework to examine the extent to which facts could and should have been collected. Consider a highly complex problem such as infection rates in a hospital. The hospital might address this issue by collecting data on the number of hours lost through illness and sick days used. But this alone does not provide increased understanding of infections; some might argue therefore that such inadequate efforts represent a deliberate form of negligence apparent in the alienation of proxies.

Relation to Computer Coding

I have a habit of inventing terms to describe things particularly in relation to application development. Experience tells me that if I fail to invent a term and dwell on its meaning, the thing that I am attempting to describe or understand will fade away. I am about to make use of a number of terms that have meaning to me in my own projects; and I just want to explain that I claim no exclusive rights or authority over these terms. In this blog, I have described data as "proxy" for "phenomena." I make use of a functional prototype called Tendril to examine many different types of data. Using Tendril, there are special terms to describe particular types of proxies: events, contexts, systems, and domains. These proxies all represent types of data or more specifically the organization of aspects of phenomena that we customarily refer to as data.

The most basic type of proxy is an event. I believe that when most people use the term "data," they mean a representation quite close to a tangible aspect of phenomena. I make no attempt to confine the meaning of phenomena. There can be hundreds of events describing different aspects of the same underlying reality. I consider the search for events a fluid process that occurs mostly on a day-to-day level rather than during design. Another type of proxy - i.e. a different level of data - is called a context. Phenomena can "participate" in events. The "relevance" of events to different contexts is established on Tendril using something called a relevancy algorithm. I placed a little person on the illustration to show what I consider to be the comfort zone for most people in relation to data. I would say that people tend to focus on the relevance of events to different contexts.

The idea of "causality" takes on special meaning in relation to the above conceptualization. Consider the argument that poverty is associated with diabetes. Two apparently different domains are invoked: social sciences and medicine. Thus, the events pertaining to social phenomena are being associated with a medical context. The social phenomena might relate to unemployment, stress, poor nutrition, inaccessible education, violence, homelessness, inadequate medical care: any reasonable person even without doing research could logically infer adverse physiological and psychological consequences. Yet the connection might not be made I believe because the proxy seems illegitimate. How can a doctor prescribe treatment? If human tolerance for social conditions has eroded, one approach is to treat the problem as if it were internal to the human body. Yet the whole point of the assertion is to identify the importance of certain external determinants. Society has come to interpret diabetes purely as a medical condition internal to the body. This is an example of how data as a proxy can become alienated from complex underlying phenomena. We say that people are diseased, failing to take into account the destructive and unsustainable environment that people have learned to tolerate.

Since there is no ceiling or floor on the distribution of proxies in real life, the focus (on contexts and events) does not necessarily limit the data that people use but rather the way that they interpret it, not being machines. I feel that due to its abundance, people habitually choose their place in relation to data; and they train themselves to ignore data that falls outside their preferred scope. Moreover, the data that enters their scope becomes contextually predisposed. Consequently, it might seem unnecessary to make use of massive amounts of data and many different contexts (e.g. in relation to other interests). But this predisposition is like choice of attire. The fact that data might fall outside of scope does not negate its broader relevance; nor does its presence within scope mean that it is relevant only in a single way.

The Phantom Zone

It is not through personal strength or resources that a person can get a road fixed. One calls city hall. There is no need to build shelter. One rents an apartment or buys a house. In human society, there are systems in place to handle different forms of data. These systems operate in the background at times without our knowledge enabling our existence in human society and offering comfort. Our lack of awareness does not mean that the systems do not exist. Nor does our lack of appreciation for the data mean that the structure of the data is unimportant. In fact, I suggest that the data can enable or disable the extent to which these systems serve the public good. Similarly, the way in which organizations objectify and proxy phenomena can lead to survivorship outcomes. An organization can bring about its own deterministic conditions.

The universal data model - really just "introduced" in this blog - is meant to bring to light the power dynamics inherent in data: the tug-of-war between disablement and participation. I have discussed how an elaborate use of proxies can help to reduce alienation (of the data from its underlying phenomena) and accommodate greater levels of complexity to support future development. This blog was inspired to some extent by my own development projects where I actually make creative use of proxies to examine phenomena. However, this is research-by-proxy - to understand through the manipulation of data structures the existence of ghosts - entities that are not necessarily material in nature. I attempt to determine the embodiment of things that have no bodies - the material impacts of the non-material - the ubiquity of the imperceptible. It might seem that humans have overcome many hostile environments. While we have certainly learned to conquer the material world, there are many more hazards lurking in the chasms of our data awaiting discovery. However, before we can effectively detect passersby in the invisible and intangible world, we need to accept how our present use of data is optimized for quite the opposite. Our evolution as a species will depend on our ability to combat things beyond our natural senses.

Read more…

Originally posted in Data Science Central by Mirko Krivanek

Leaflet is a modern open-source JavaScript library for mobile-friendly interactive maps. It is developed by Vladimir Agafonkin with a team of dedicated contributors. Weighing just about 33 KB of JS, it has all the features most developers ever need for online maps.

Leaflet is designed with simplicityperformance and usability in mind. It works efficiently across all major desktop and mobile platforms out of the box, taking advantage of HTML5 and CSS3 on modern browsers while still being accessible on older ones. It can be extended with a huge amount of plugins, has a beautiful, easy to use and well-documented API and a simple, readable source code that is a joy to contribute to.

In this basic example, we create a map with tiles of our choice, add a marker and bind a popup with some text to it:

For an interactive map and source code in text format, click here.

Learn more with the quick start guide, check out other tutorials, or head straight to the API documentation
If you have any questions, take a look at the FAQ first.

Related Articles

Read more…

Data Scientists vs. Data Engineers

Guest blog post by Michael Walker

More and more frequently we see o rganizations make the mistake of mixing and confusing team roles on a data science or "big data" project - resulting in over-allocation of responsibilities assigned to data scientists. For example, data scientists are often tasked with the role of data engineer leading to a misallocation of human capital. Here the data scientist wastes precious time and energy finding, organizing, cleaning, sorting and moving data. The solution is adding data engineers, among others, to the data science team.
Data scientists should be spending time and brainpower on applying data science and analytic results to critical business issues - helping an organization turn data into information - information into knowledge and insights - and valuable, actionable insights into better decision making and game changing strategies.
Data engineers are the designers, builders and managers of the information or "big data" infrastructure. They develop the architecture that helps analyze and process data in the way the organization needs it. And they make sure those systems are performing smoothly.
Data science is a team sport . There are many different team roles, including: 
Business architects;
Data architects;
Data visualizers;
Data change agents.
Moreover, data scientists and data engineers are part of a bigger organizational team including business and IT leaders, middle management and front-line employees. The goal is to leverage both internal and external data - as well as structured and unstructured data - to gain competitive advantage and make better decisions. To reach this goal an organization needs to form a data science team with clear roles.
Read more…

Join us for the latest DSC Webinar on March 24, 2015
Space is limited.
Reserve your Webinar seat now
Please join us March 24, 2015 at 9am PST for our latest DSC's Webinar Series: Data Lakes, Reservoirs, and Swamps: A Data Science and Engineering Perspective sponsored by Think Big, a Teradata Company.

In the fast paced and ever changing landscape of Hadoop based data lakes, there tends to be varying definitions of what constitutes a data lake and how they should be used for business benefit—especially in leveraging data science.

In this webinar, Think Big will share their perspective on Hadoop data lakes from their many consulting engagements. Drawing from their experience across multiple industries, Daniel Eklund and Dan Mallinger will share stories of data lake challenges and successes. You will also learn how data scientists are leveraging Hadoop data lakes to discover, document, and enable new business insights.

Finally, the presenters will discuss skills needed for data science success and how to grow your skills if you want to become a data scientist. 

Daniel Eklund, Data Science Practice Manager, Think Big
Dan Mallinger, Engineering Practice Manager, Think Big

Hosted by: Tim Matteson, Cofounder, Data Science Central
Title:  Data Lakes, Reservoirs, and Swamps: A Data Science and Engineering Perspective
Date:  March 24, 2015
Time:  9:00 AM - 10:00 AM PT
Again, Space is limited so please register early:
Reserve your Webinar seat now
After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Guest blog post by Deepak Kumar

Before going into details about what is big data let’s take a moment to look at the below slides by Hewlett-Packard.

So by going through these slides you must have realized that how much data we are generating every second, of every minute, of every hour, of every day, of every month, of every year.

The phrase that is really popular nowadays and also talks the truth: We have generated more than 90% of data in the last two years itself.

And it is getting generated exponentially day by day with the increasing usage of devices and digitization across the globe.

So what is the problem with these huge amounts of data?

Earlier when common database management application systems were made those systems were built with a scale in mind. Even the organizations were not prepared of the scale that’s what we are producing nowadays.

Since the requirements of these organizations have increased over time, they have to rethink and reinvest in the infrastructure. Now the cost of resources involved in scaling up the infrastructure, gets increases with an exponential factor.

Further, there would be a limitation on the different factors like size of the machine, CPU, RAM etc that could be scaled up. These traditional systems would not be able to support the scale required by most of the companies.

Why traditional data management tools and technologies cannot handle these numbers?

Whatever data that is coming to us can be categorized with respect to VOLUME, VELOCITY and VARIETY. And the problem starts here.

  • Volume: Today organizations like NASA, Facebook, Google and many other such companies are producing enormous amount of data per day. These data needs to be stored, analyzed and  processed in order to know about the market, trends, customers and their problems along with the solutions.
  • Variety: We are generating data from different sources in different forms, like videos, text, images, emails, binaries and lots more, and most of these data are unstructured or semi structured. The traditional data systems that we know all works on structured data. so it is quite difficult for those system to handle the quality and quantity of data we are producing nowadays.
  • Velocity: Take an example of a simple query where you want to fetch the name of a person from millions of record. Till the time it is in millions or billions we are fine with the traditional systems , but when it reaches more than that even simplest of query takes lots of time for the execution. And here we are talking about the analysis and processing of data that is in the range of hundreds and thousands of petabytes, exabytes and much more. So to analyze the same we have to develop a system that will process the data at much higher speed and with high scalability.

These volume, velocity and variety also popularly known as 3 Vs are worked out using the solutions provided by BigData.  So before going into details of how bigdata handles these complex solutions, let’s try to create a short definition for BigData.

What is Big Data?

Dataset whose volume, velocity, variety and complexity are beyond the ability of commonly used tools to capture, process, store, manage and analyze them can be termed as BIGDATA.

How BigData is handling these complex situations?

Most of the BigData tools and framework architecture are built keeping in mind about the following characteristics:
  • Data distribution: The large data set is split into chunks or smaller blocks and distributed over N number of nodes or machines. Hence the data gets distributed on several nodes and becomes ready for parallel processing. In Big data world this kind of data distribution is done with the help of Distributed File System or DFS.
  • Parallel processing:  The distributed data gets the power of N number of servers and machines in which data is residing and works in parallel for the processing and analysis. After processing, the data gets merged for the final required result. The process is known as MapReduce which is adopted from Google’s MapReduce research work.
  • Fault tolerance: Generally we keep the replica of a single block (or chunk) of data more than once. Hence even if one of the servers or machine is completely down, we can get our data from a different machine or data center. Again we might think that replicating of data might cost lots of space. And here comes the fourth point for the rescue.

    • Use of Commodity hardware:  Most of the BigData tools and frameworks need commodity hardware for its working. So we don’t need specialized hardware with special RAID as Data container. This reduces the cost of the total infrastructure.
    • Flexibility and Scalability: It is quite easy to add more and more of rackspace into the cluster as the demand for space increases. And the way these architecture are made, it fits into the scenario very well.

    Well these are just a few examples from the bigdata reservoir for the complex problems that is getting solved using bigdata solutions. 

    Again this article talks about only a glass of water from the entire ocean. Go get started and take a dip dive in the bigdata world or if i can say BigData Planet :)

    The article First appeared on

    If you like what you just read and want to continue your learning on BIGDATA you can subscribe to our Email and Like our facebook page
    Read more…

    MapReduce / Map Reduction Strategies Using C#

    Guest blog post by Jake Drew

    A Brief History of Map Reduction

    Map and Reduce functions can be traced all the way back to functional programming languages such as Haskell and its Polymorphic Map function known as fmap.  Even before fmap there was the Haskell map command used primarily for processing against lists.  I am sure there are experts out there on the very long history of MapReduce who could provide all sorts of interesting information on that topic and the influences of both Map and Reduce functions in programming.  However, the purpose of this article is to discuss effective strategies for performing highly parallel map reductions in the C# programming language.  There are many large-scale packages out there for performing map reductions of just about any size.  Google's MapReduce and Apache's Hadoop platforms are two of the most well known.  However, there are many competitors in this space.  Here are just a few references.  MapReduce concepts are claimed to be around 25 years old by some.  Strangely enough, the patent for MapReduce is currently held by Google and was only issued during 2004.  Google says that MapReduce was developed for "processing large amounts of raw data, for example, crawled documents or web request logs".

    Understanding Map Reduction

    In more complex forms, map reduction jobs are broken into individual, independent units of work and spread across many servers, typically commodity hardware units, in order to transform a very large and complicated processing task into something that is much less complicated and easily managed by many computers connected together in a cluster.  In layman's terms, when the task at hand is too big for one person, then a crew needs to be called in to complete the work.  Typically, a map reduction "crew" would consist of one or more multi-processor nodes (computers) and some type of master node or program that manages the effort of dividing up the work between nodes (mapping) and the aggregation of the final results across all the worker nodes (reduction).  The master node or program could be considered the map reduction crew's foreman.  In actuality, this explanation is an over-simplification of most large map reduction systems.  In these larger systems, many additional indexing, i/o, and other data management layers could be required depending on individual project requirements.  However, the benefits of map reduction can also be realized on a single multi-processor computer for smaller projects.

    The primary benefits of any map reduction system come from dividing up work across many processors and keeping as much data in memory as possible during processing.  Elimination of disk i/o (reading data from and writing data to disk) represents the greatest opportunity for performance gains in most typical systems.  Commodity hardware machines each provide additional processors and memory for data processing when they are used together in a map reduction cluster.  When a cluster is deployed however, additional programming complexity is introduced.  Input data must be divided up (mapped) across the cluster's nodes (computers) in an equal manner that still produces accurate results and easily lends itself to the aggregation of the final results (reduction).  The mapping of input data to specific cluster nodes is in addition to the mapping of individual units of input data work to individual processors within a single node.  Reduction across multiple cluster nodes also requires additional programming complexity.  In all map reduction systems, some form of parallel processing must be deployed when multiple processors are used.  Since parallel processing is always involved during map reduction, thread safety is a primary concern for any system.  Input data must be divided up into individual independent units of work that can be processed by any worker thread at any time during the various stages of both mapping and reduction.  Sometimes this requires substantial thought during the design stages since the input data is not necessarily processed in a linear fashion.  When processing text data for instance, the last sentence of a document could be processed before the first sentence of a document since multiple worker threads simultaneously work on all parts of the input data.

    The following figure illustrates a map reduction process running on a single multi-processor computer.  During this process, multiple worker threads are simultaneously mapped to various portions of the input data placing the mapping results into a centralized location for further downstream processing by other reduction worker threads.  Since this process occurs on a single machine, mapping is less complex because the input data is only divided between worker threads and processors that all reside on the same computer and typically within the same data store.

    Map Reduction On a Single Computer

    When multiple computers are used in a  map reduction cluster, additional complexity is introduced into the process.  Input data must be divided between each node (computer) within the cluster by a master node or program during processing.  In addition, reduction results are typically divided across nodes and indexed in some fashion so mapping results can quickly be routed to the correct reduction node during processing.  The need for clustering typically occurs when input data, mapping results, reduction results, or all three are too large to fit into the memory of a single computer.  Once any part of the map reduction process requires disk i/o (reading data from and writing data to disk), a huge performance hit occurs.  It is very important to stress that this performance hit is exponential and deadly.  If you are still skeptical, please stop reading and take a quick lesson from the famous Admiral Grace Hopper here.  Obviously, some form of disk i/o is required to permanently save results from any program.  In a typical map reduction system however, disk i/o should be minimized or totally removed from all mapping and reduction processing and used only to persist or save the final results to disk when needed.

    The following figure illustrates a map reduction cluster running on four machines.  In this scenario, one master node is used to divide up (map) input data between three data processing nodes for eventual reduction.  One common challenge when designing clusters is that not all of the reduction data can reside in memory on one physical machine.  In an example map reduction system that processes text data as input and counts unique words, all words beginning with A-I might be stored in node 1, J-R in node 2, and S-Z in node 3.  This means that additional routing logic must be used to get each word to the correct node for a final reduction based on each word's first letter.

    Map Reduction Cluster

    When reduction or mapping results are located on multiple clustered machines, additional programming logic must be added to access and aggregate results from each machine as needed.  In addition, units of work must be allocated (mapped) to these machines in a manner that does not impact the final results.  During the identification of phrases for instance, one sentence should not be split across multiple nodes since this could causes phrases to be split across nodes and subsequently missed during phrase identification as a result.

    Map reduction systems can range in size from one computer to literally thousands of clustered computers in some enterprise level processes.  The C# programming language provides a suite of thread-safe objects that can easily and quickly be used to create map reduction style programs.  The following sections describe some of these objects and show examples of how to implement robust parallel map reduction processes using them.

    Understanding Map Reduction Using C#

    The C# programming language provides many features and classes that can be used to successfully perform map reduction processing as described in the sections above.  In fact, certain forms of parallel map reduction in C# can be performed by individuals having a minimal knowledge of thread pools or hardware specific thread management practices.  Other lower level tools however, require a great knowledge of both.  Regardless of the tools chosen, great care must be taken to avoid race conditions when parallel programs are deployed within a map reduction system.  This means that the designer must be very familiar with best demonstrated practices for both locking and multi-threaded programming when creating the portions of mapping and reduction programs that will be executed in parallel.  For those who need assistance in this area, a detailed article on threading in C# can be located here.

    One of the most important things to remember is that just because a particular C# object is considered "thread safe", the commands used to calculate a value that is passed to the "thread safe" object or the commands passed within a delegate to the "thread safe" object are not necessarily "thread safe" themselves.  IF a particular variable's scope extends outside the critical section of parallel execution, then some form of locking strategy must be deployed during updates to avoid race conditions.  One of the easiest ways to test for race conditions or threading errors in a map reduction program is to simply execute the program using the same set of input data multiple times.  Typically, the programs results will vary when a race condition or threading error is present.  However, the error might not present itself after only a few executions.  It is important to exhaustively test the program using many different sets of test data as input, and then execute the program many times against each input data set checking for output data variations each time.

    The specific C# classes described later in this document do not represent the only alternatives for performing parallel map reductions in the language.  The selected classes merely represent a few of the viable techniques worth consideration.  For instance, one available approach that is not covered in this document is PLINQ which is C#'s parallel implementation of LINQ to Objects.  Numerous other C# tools are available as well.  It is important to mention that the map reduction patterns described above are sometimes referred to / very similar to what is known as  producer / consumer pipelines.  Many great articles can be located on the internet when producer / consumer pipelines and C# are used together as search terms.

    The Map Reduction Nuts and Bolts

    Using the pattern described earlier, several basic C# components can be repeatedly used (and sometimes extended) to create a map reduction system of virtually any size.  The following high level C# components and classes will be used as the "nuts and bolts" of this particular system:

    • Parallel.For and Parallel.ForEach -  These two members of the System.Threading.Tasks namespace can be used to quickly create mapping functions that execute in parallel.  The commands executed within these "For" blocks must be thread safe.  Parallel mapping functions can be used to break apart input data into mapping results that are placed in a Blocking Collection for further downstream processing.
    • Blocking Collections - Blocking Collections are members of the System.Collections.Concurrent namespace and provide a centralized, thread safe location for multiple threads to add and remove objects during processing.  These collections can be implemented using concurrent bag (not ordered), stack (LIFO), or queue (FIFO) collections.  Thread safe versions of each collection are provided within the System.Collections.Concurrent namespace.  Once the Blocking Collection has been wrapped around the appropriate bag, stack, or queue, it will manage timing differences between various producers and consumers using the collection.  When the collection is empty it can block until new items are added or stop processing once all items have been processed and the collection is marked as complete.
    • Concurrent Dictionary - The thread safe Concurrent Dictionary will act as a key-value pair repository for the final reduction results in our process.  Although a database could be used for this part of the process, the Concurrent Dictionary is an ideal reduction repository candidate for several basic reasons that are explained in detail within the reduction section and examples below.  This dictionary is also a member of the System.Collections.Concurrent namespace.


    Parallel Mapping Using C#

    One of the easiest ways to implement a parallel map function in C# is by using the Parallel class.  Specifically Parallel.For or Parallel.ForEach can be used to quickly map (in parallel) independent units of work into a centralized, thread-safe collection (we will get to thread-safe collections in a second) for further downstream processing.  In addition, the class can perform parallel mappings with no lower level thread management programming required.  The Parallel class is hardware intelligent and scales threads based on the current platform it is executing on.  However, it also has the MaxDegreeOfParallelism option for those who want more control over how many threads a particular Parallel class process is using.  The primary purpose of the map function is to break apart input data producing one or more key-value pairs that require reduction processing by other downstream worker threads.  While mapping worker threads are producing these key-value pairs as output, reduction worker threads are simultaneously consuming (reducing) them.  Depending on the size of the map reduction process, other intermediary processes such as a partitioning process might occur between a mapping and its final reduction.

    Yield Return Blocks

    In a C# example application that counts unique words from large amounts of input text, one or more stages of mapping could be used to produce the final mapping key-value pair data output.  In order for both mapping and reduction worker threads to work in tandem, all phases of the mapping process must be executed asynchronously.  This can be accomplished by using either background threads, some form of Yield Return processing, or both.

    The following code demonstrates the use of Yield Return processing to break up input text into blocks of 250 characters or less using the space character as a word delimiter:
    Yield Return Mapping
    As each 250 character (or less) block of text is identified, the "yield return" command causes the process to "yield" and immediately return the identified text block to the calling process.  Under normal circumstances, all identified blocks of text would be returned at one time when the entire process was completed.  This would also mean that other downstream worker threads could not begin work on mapping the blocks of text into individual words until all text block identification was complete.  The delay would slow down the entire process greatly.  A yield return method for producing text blocks is not necessarily required for counting unique words in a map reduction system.  However, this code will be used to demonstrate how yield return can be used and subsequently called using Parallel.ForEach to complete the mapping of text to individual words.

    Blocking Collections

    When all stages of mapping are completed, the final results are added to a mapping results Blocking Collection.  Using our C# example application that counts unique words from large amounts of input text, a mapping results Blocking Collection is created called "wordChunks".  This particular Blocking Collection uses a ConcurrentBag as its base collection.  Since words are added to and removed from the ConcurrentBag in no particular order, using a "bag" yields performance gains over a "stack" or "queue" which must internally keep track of processing order.  The following code shows how the "wordChunks" Blocking Collection is created:

    Blocking Collection

    Technically a mapping function's output should be a key-value pair.  The key-value pair typically contains some form of key and associated numeric value that are both used during the reduction process.  In many cases, the key will only be contained one time within the final reduction results.  The values for each duplicate key encountered during mapping will be "reduced" by either summation or another mathematical calculation that results in one final reduction number value that is representative of the single key contained in the final key-value pair reduction results.  In our example word counting map reduction, a key-value pair is not even required for the mapping stage. The wordChunks bag can contain any number of words (given your current memory constraint).  These words can also be duplicates.  Since we are only counting the occurrence of words, each word in our bag is considered to have a default frequency of 1.  However, the ConcurrentBag could have just as easily been created as a collection of key-value pairs (ConcurrentBag<KeyValuePair<string,int>>), if needed.

    Parallel.ForEach Processing

    The next program demonstrates a Parallel.ForEach mapping function using the Yield Return method created before.  This process uses multiple worker threads to identify and clean words from the blocks of input text provided by the Yield Return method.   The Parallel.ForEach mapping process begins as soon as the first block of text is identified since Yield Return is being used.


    The program above uses Parallel.ForEach to call the text block production program "produceWordBlocks".  This program immediately yield returns blocks of text less than 250 characters in length and delimited by spaces as they are identified.  Parallel.ForEach worker threads simultaneously process these text blocks identifying individual words which are also delimited by spaces.  The program also removes any whitespace, punctuation, or control characters located within the words.  Obviously, this is an example program and many other character filtering or inclusion enhancements could made depending on your particular requirements.  In an alternative implementation, the Yield Return method could be removed entirely and its functionality included into a single Parallel.ForEach mapping program.  This may or may not produce better performance results depending on your code, the input data, and the requirements of your system.

    Once all individual words have been identified from all word blocks, the wordChunks Blocking Collection is notified that no more words will be added to the collection.  This notification is very important for any downstream worker threads that are simultaneously reducing / removing words from the collection.  If the Blocking Collection becomes empty during processing, collection consumers will continue to "block" or "wait" until either the CompleteAdding() method is called or additional items are added into the collection.  The Blocking Collection is able to successfully manage differences in the object production and consumption speeds of various worker threads using this approach.  In addition, a Blocking Collection Bounding Capacity can be added to ensure that no more than a maximum number of objects can be added to the collection at any given time.

    Parallel Reduction Using C#

    The process of parallel reduction is very similar to mapping with regards to the use of Parallel.ForEach to facilitate its process in our example application.  Where reduction differs however, is in its use of one or more data storage components that provide very fast access to a particular set of reduction key-value pairs.  When all data storage components are combined, the reduction key-value pairs eventually become the final output for map reduction once all input data has been processed. In a system where multiple map reduction computers are used together in a map reduction cluster, multiple data storage components could be used to store a very large number of key-value pairs across several computers.  In the example word counting map reduction process, the reduction key-value pairs consist of a unique list of words which act as keys for the reduction key-value pairs.  Each word key contains a value that represents the total frequency of occurrences for that particular word within the input text.  The final result becomes a mapping of all words contained in the input data to a reduced list of unique words and the frequency that each unique word occurs within the input data.

    The Concurrent Dictionary

    The C# ConcurrentDictionary is basically a thread safe hash table that is well suited for acting as the data store component within the example application.  The ConcurrentDictionary holds key-value pairs in memory until there is either no room left in memory or the dictionary already contains the maximum number of elements.  Since the ConcurrentDictionary takes a 32-bit hash of each key, the maximum number of elements is the same as the Int32.MaxValue or 2,147,483,647.  Most computers and processes will run out of memory prior to triggering the ConcurrentDictionary's overflow exception due to exceeding the maximum number of elements.  In situations where very large amounts of data are being map reduced, the map reduction data components (in this case Concurrent Dictionaries) can be sharded across several computers within a cluster.  However, sharding requires a slightly more complex map reduction process since key partitioning logic must be developed to manage the associated sharding challenges such as what nodes(computers) will contain what keys, node load balancing, node additions, node removals, and node failures.

    It is obvious that some readers may be asking why not use a database at this point, and this is a very valid question.  The short answer is that most any data management solution could be used.  Databases, NoSQL databases, In-memory databases, or some form of key-value datastore could be implemented.  The most important thing to consider however, is that most relational databases will rely heavily on i/o to complete the work.  Any form of i/o during map reduction processing will also most likely defeat the purpose of map reduction altogether.  So whatever data management solution is chosen, just make sure that your data is being stored in memory.  Even most relational databases now have some form of in-memory tables and clustering abilities.

    Parallel.ForEach Reduction Processing

    During reduction processing Parallel.ForEach is used once again to create multiple worker threads that simultaneously consume mapping results as they are being created and added to the "wordChunks" Blocking Collection.  Worker threads reduce all mapping results in the example application by looking up each mapped word within the reduction data store component (one or more ConcurrentDictionaries in this case).  If a word already exists in the data store, then the mapping word is reduced by incrementing the existing reduction word's key-value pair value by 1.  Otherwise, the mapping word is reduced by creating a new key-value pair entry in the reduction data store with a starting value of 1.  The following code demonstrates how this process works:

    Reduction Parallel.ForEach

    A ConcurrentDictionary was created to hold our final reduction processing results.  Furthermore, the AddOrUpdate() method was taken advantage of during Parallel.ForEach processing.  It is important to mention once again that delegates provided to a thread safe object are not necessarily thread safe themselves.  In this case, the AddOrUpdate method accepts a delegate to provide update commands to execute when a key is a present in the wordStore dictionary.  To ensure that the update is performed in a thread safe manner, Interlocked.Increment is used to increment existing values by 1 as an atomic operation each time they are encountered.  The Parallel.ForEach process executes against the wordChunks Blocking Collection removing mapping results (words) from the collection until all words have been processed.  The Blocking Collection will also cause to the Parallel.ForEach reduction process to "block" or wait for additional words when the collection becomes empty and the CompleteAdding() method has not yet been called by the producer (the mapWords method in our example program).  Using the Blocking Collection's GetConsumingEnumerable() method in the Parallel.ForEach loop is one way to trigger the blocking behavior.

    The C# Map Reduction Summary

    The previous figure of a map reduction process running on a single multi-processor computer can now be updated to reflect the C# objects and classes discussed in our example application.  Using only a few key C# components, parallel map reduction can be preformed with minimal effort when compared to creating parallel map reduction processes in other languages.  The figure below represents a map reduction process written in C# and running on a single multi-processor computer:

    MapReduce In C#Once all components of the map reduction process have been created, a small mapReduce method is written to bring the entire process altogether.  One of the most important parts of the mapReduce method is to create a background process for execution of the mapping function.  While the mapping function populates the Blocking Collection with mapping results in the background, the reduction function simultaneously removes / reduces the mapping results into the reduction data store.  Since all of this processing occurs in memory, the mapReduce method is extremely fast.  The following mapReduce program ties together the entire map reduction process for our example application:

    MapReduce MethodPrinting the results of your map reduction is also as simple as printing the contents of your wordStore dictionary:

    Display MapReduce Results


    Map reduction processing provides an innovative approach to the rapid consumption of very large and complex data processing tasks.  The C#  language is also very well suited for map reduction processing.  This type of processing is described by some in the C# community as a more complex form of producer / consumer pipelines. Typically, the largest potential constraint for any map reduction system is i/o.  During map reduction processing, i/o should be avoided at all costs and used only for saving final map reduction results, if possible.  When the processes described in this document are combined with data store sharding and map reduction partitioning processes, data processing tasks of most any size can be accommodated.  If you actually read this far, I'm impressed.  Thank you!


    Read more…

    Analysts use tools to perform various types of spatial analysis such as:

    • Cheapest Home Insurance location within 50 miles of Dallas
    • What locations are most amenable given the income, population & other demographics of a place within a 25 mile radius of New York City?
    • What zip codes have the highest crimes rates within 25 miles of Chicago?

    However, when we try and convert this analysis into a real life operational system that is high traffic with low latency and little room for error, most tools that perform well offline don’t live up to expectations.

    Amazon Web Service’s, the cloud computing platform from Amazon is one of the leaders in providing cloud based hosting solutions. Over the years, they have been steadily adding several software services to take advantage of their hardware platform. One search software service is the “AWS Cloud Search” that is itself used to power Amazon’s high performance e-commerce search. Now, the same technology can be used by other customers for searching.

    The below case study shows how AWS Cloud search can be used to perform geo-searching & spatial analytics to find the cheapest home insurance within a given area. The below heat map shows the varying home property values (and with it corresponding home insurance prices) across the nation.


    The sample data underlying the above heat map is given in the below table.

    The challenge from a real time operational geo-analytical spatial search perspective is to find the cheapest home insurance within a 200 miles radius of San Francisco.

    To do this with AWS Cloud Search, we need to first set up the search domain within AWS Cloud search based on the following broad activities:

    • Search Domain Creation & Configuration
    • Data Upload to your search domain. The Indexed fields in our data will include, home insurance rates, home prices, zip code & lat long details.
    • Data search within AWS Cloud Search & Controlling of Search results

    For calculating distances between places, we use Cosine search. Details on the math & logic behind the cosine search can be found here.

    Once we have the indexed the document and you want to return all places within a 200 mile radius of San Francisco, the query is below.

    dis_rank="&rank-dis=acos(sin(37.7833)*sin(3.141*lat/(1000000*180))%2Bcos(37.7833)*cos(3.141*lat/(1000000*180))*cos(-122.4167-(-3.141*(long-18100000)/(100000*180))))*6371*0.6214" ;threshold=”&t-dis=..200”

    The query is showing the places for San Franciso Lat Long coordinates which is given below.

    • latitude =37.7833
    • longitude = -122.4167
    • radius = 200

    When you pass this query to AWS Cloud Search, the speed of the response is on part with the best search engines in the world. The tuning & maintenance to operationalize such performance will take teams years to deliver.  So, when you do a cost & benefit analysis on operationalizing your real time spatial analytics, consider outsourcing key parts of your infrastructure to a search infrastructure that powers “Earth’s Largest Store”!


    This is analysis was written by, a home insurance data analytics service.

    DSC Resources

    Read more…
    Join us March 3, 2015 at 9am PST for our latest DSC's Webinar Series: Better Risk Management with Apache Hadoop and Red Hat sponsored by Hortonworks and Red Hat. Organization firms are operating under significantly increased regulatory requirements in the wake of the 2008 financial crisis. The risk management systems that each firm operates must respond not only to new reporting requirements but also handle ever-growing amounts of data to perform more comprehensive analysis. Existing systems that aren't designed to scale for today’s requirements can't finish reporting in time for start of trading. And, many of these systems are inflexible and expensive to operate.
    Read more…

    Webinar Series

    Follow Us

    @DataScienceCtrl | RSS Feeds

    Data Science Jobs