Subscribe to our Newsletter

All Posts (76)

How to Collect Big Data from Social Media

The best way to Gather Big Data From Social Networking ?

You understand that social networking information is the best thing because the Marketer's Maker Cocktail. Which means you do not have all this info at your fingertips but hey, you are not Mark Zuckerburg. And you almost certainly do not have versions and complex systems in place for assessing and collecting lots of information.

Do not fret. There are straightforward strategies you can implement now to begin collecting big data intelligence about your devotees and followers.

1. Examine place operation that is societal

It is time. You will learn what your consumers adore by studying the operation of your individual social networking posts. Subsequently quantify post operation on a group basis.

You'll realize a lot more than anyone which issues and products your customers really adore by monitoring your most popular societal content. Make it a custom to share those insights throughout the advertising team. Surfacing this knowledge will help your team members make choices that are insightful about e-mail campaigns, ecommerce promotions, much more and TV advertisements. You will make a larger impact on advertising operation, and earn some brownie points that are major to boot.

2. Collect contact info through networking that is social marketing campaigns

You have amassed an enormous base of followers and supporters, however would you understand who they're? Find out more by running networking efforts that are social with information record parts.

Influence efforts to get demographic information like sex, age and place. Gather contact information like cellular telephone number, mailing address, and email address. Utilize this information to get in touch with effort participants in other channels and transfer them. Do not forget to ask to reach them!

Within your effort, you need to even be collecting setting information in tandem with contact info. Several examples:

Utilize that information to serve up advertising that are tailored with coupons for his or her favorite pick.
For instance, collect e-mails within your "marathon pride" picture competition, and send them promotions for work out equipment.

Read more…

Guest blog post by Bernard Marr

In a meeting with Airbus last week I found out that their forthcoming A380-1000 – the supersized airliner capable of carrying up to 1,000 passengers – will be equipped with 10,000 sensors in each wing.

The current A350 model has a total of close to 6,000 sensors across the entire plane and generates 2.5 Tb of data per day, while the newer model – expected to take to the skies in 2020 – will capture more than triple that amount.

In an industry as driven by technology as the aviation industry, it’s hardly surprising that every element of an aircraft’s performance is being monitored for the potential to make adjustments which could save millions on fuel bills and, more importantly, save lives by improving safety.

So I thought this would be a good opportunity to explore how the aviation industry, just like every other industry, is putting data science to work.

There are 5,000 commercial aircraft in the sky at any one time over the US alone, and 35 million departures each year. In other words the aviation industry is big. And given that every single passenger on each of those flights is putting their life in the hands of not just the pilot, but the technology, the safety measures and regulations in place are extremely complex.

This means that the data it generates is big, and complex too. But airlines have discovered that with the right analytical systems, it can be used to eliminate inefficiencies due to redundancy, predict routes their passengers are likely to need, and improve safety.

Engines are equipped with sensors capturing details of every aspect of their operation, meaning that the impact of humidity, air pressure and temperature can be assessed more accurately. It is far cheaper for a company to be able to predict when a part will fail and have a replacement ready, than to wait for it to fail and take the equipment offline until repairs can be completed.

In fact, Aviation Today reported that it can often take airlines up to six months to source a replacement part, due to inefficient prediction of failures leading to a massive backlog with manufacturers.

On top of this fuel usage can be economized by ensuring engines are always running at optimal efficiency. This not only cuts fuel costs but minimizes environmentally damaging emissions.

In the case of Airbus, they partnered with IBM to develop their own Smarter Fuel system, specifically to target this area of their operation with Big Data and analytics.

Additionally, airlines closely monitor arrival and departure data, correlating it with weather and related data to predict when delays or cancellations are likely – meaning alternative arrangements can be made to get their passengers where they need to be.

Before they even take off, taxi times between the departure gates and runways is also recorded and analyzed, allowing airlines and airport operators to further optimize operational efficiency – meaning less delays and less unhappy passengers.

This sort of predictive analysis is common across all areas of industry but is particularly valuable in commercial aviation, where delays of a few hours can cost companies millions in rearrangements, backup services and lost business (The FAA estimates that delayed flights cost the US aviation industry $22 million per year).  

Specialist service providers have already cropped up – masFlight is one – aiming to help airlines and airports make the most of the data they have available to them.

They aggregate data sets including weather information, departure times, radar flight data and submitted flight plans, monitoring 100,000 flights every day, to enable operators to more efficiently plan and deliver their services.

In marketing, too, airlines are beginning to follow the lead of companies such as Amazon by collecting data on their customers, monitoring everything from customer feedback to how they behave when visiting websites to make bookings.

Now we are used to generating and presenting tickets and boarding cards through our smartphones, more information about our journey through the airport, from the time we enter to the time we board our flight can also be tracked. This is useful both to airport operators, managing the flow of people through their facilities, and to airlines who will gather more information on who we are and how we behave.

So businesses in the aviation industry, including Airbus, are making significant steps towards using data to cut waste, improve safety and enhance the customer experience. 10,000 sensors in one wing may sound excessive but with so much at stake – both in terms of profits and human lives – it’s reassuring that nothing will be overlooked.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here.

Read more…

Is Spark The Data Platform Of The Future?

Originally posted on Data Science Central

Hadoop has been the foundation for data programmes since Big Data hit the big time. It has been the launching point for data programmes for almost every company who is serious about their data offerings.

However, as we predicted we are seeing that the rise in in-memory databases has seen the need for companies to adopt frameworks that harness this power effectively.

It was therefore no surprise that Apache have launched Spark, a new framework that utilizes in-memory primitives to deliver performance around 100 times faster than Hadoop’s two-stage disk-based version.

This kind of product has become increasingly important as we move forward into a world where the amount and speed of data has been increasing exponentially.

So is Spark going to be the Hadoop beater that it seems to be?


This kind of technology that allows us to make decisions quicker and with increased amounts of data is going to be something that companies are clamouring for.

It is not simply in principle that this platform will be bringing about change either. As an open source platform, it has the most developers working on it across every Apache product.

This suggests that people support the idea through their willingness to dedicate their time to it. It is common knowledge that many of the data scientists working on Apache products are the same ones who will be using it in their day-to-day roles at different companies, which could suggest that they are going to adopt this system in the future.


One of the main reasons for the success of Hadoop in the last few years has been not only due to its ease of use, but also that companies can get it for nothing. This is because you can run the basics of Hadoop on a regular system and will only need to upgrade when they ramp up their data programmes.

Spark runs on-memory systems which requires a system with high performance, something that companies new to data initiatives are unlikely to invest in.

So which is it more likely to be?

In my opinion, Hadoop will always be the foundation of data programmes and with more companies looking at adopting it as the basis for their implementations, this is unlikely to change.

Spark may well become the upgrade that companies who move to a stage where they want, or need, improved performance will adopt. As Spark can work alongside Hadoop this seems to have also been in the minds of the guys at Apache when coming up with the product in the first place.

Therefore, it is unlikely to be a Hadoop beater, but will instead become more like its big brother. It is capable of doing more, but at increased cost and only necessary for certain data volumes and velocities, is not going to be a replacement. 

Read more…

Machine Learning at Scale with Spark

Originally posted on Data Science Central

In my last post, I covered setting up the basic tools to start doing machine learning (Python, NumPy, Matplotlib and Scikit-Learn).  Now, you are probably wondering how to do this on a very large scale, involving terabytes (may be even petabytes) of data and across several server nodes.  

The best answer is Apache Spark !  Spark is an in-memory analytics engine which runs on top of HDFS and also unifies many other data sources e.g. NoSQL databases like MongoDB or even CSV files. Spark is also a much faster and simpler replacement of Hadoop's original processing model - MapReduce.  IBM has announced plans to include Spark in all its analytics platforms and has committed  3,500+ developers to Spark-related projects.

The picture below shows how Spark plays across applications, data sources and environments. 

A ton of material is already available to tell you the benefits of Spark.  I will keep it short and simple.  Spark is great because it is free, it is Open Source,  General Purpose,  scales up massively (I mean up to 8000 nodes and Petabytes of data) and has amazing speed (~100x faster than traditional Map Reduce. details here!) and comes with a delightfully elegant programming model!  BTW, it is the programming model with deep roots in Functional programming that won me over. Details here!

Spark is also the hottest project in Apache.


How do we get started? See this course taught by Professor Anthony Joseph of Berkeley.   The lectures were broken in small bite-size videos (3 to 4 minutes maximum) which are simple and very nicely explained.  Some are followed by quiz questions which helps to validate the knowledge immediately.  The entire course environment is provided in a Virtual machine (you need VirtualBox, Vagrant, and the image) which is runnable on your laptop.  The best part of the course were the 4 labs.  The labs came as iPython notebooks with sample exercises.  Each exercise was followed by tests which gave a pass/fail result immediately.  

The first lab was meant to count the most frequent words in ALL  of Shakespeare's plays.  The second lab provided WebServer Logs from NASA and asked students to parse the Apache Common Log Format, create Spark RDDs (that is Resilient Distributed Datasets), and analyze how many valid requests/responses (200X), how many failed, which resources failed and when!

A screenshot of a section of this lab to visualize 404 responses by hour of day is shown below.

The third lab provided product listings from Google and Amazon and the objective was to use a Bag-of-words technique to break up product descriptions into tokens and compute similarity between products.  A TF-IDF  (Term Frequency and Inverted Document Frequency) technique was used to compute similarity between documents of product descriptions.  Learning such powerful text analysis techniques to do entity resolution would be a real asset to solve live problems.

The fourth lab was the icing on the cake.  It analyzed  movies from IMDB and came up with historical ratings of movies.  A dataset of 500K ratings came along with the VM.   

A screenshot of a section of this lab to retrieve highest ever rated movies shown below.

It is a lot of fun to see the highest rated movies and be able to run your own queries on the RDD :-)

The lab used Spark's MLlib (Machine Learning Library) to use Collaborative Filtering.   This is a method to make automatic predictions about user interests using preferences of many (collaboration).  The basic assumption is if a person A has the same opinion as B on one item x, then they are more likely to have same opinion on another item y compared to a random person. CF was combined with Alternating Least Squared techniques to make predictions of movie ratings.  Finally, the lab asked the user to rate a small sample of movies to make personalized movie recommendations.  You will love this lab!

Additionally, the course discussion group was full of questions and very supportive responses from the staff and other students.

I found this to be an excellent course, at the right level of difficulty and helpful in  de-mystifying Spark for a beginner and putting it to actual use.   

I am looking forward to the next in the series - "Scalable Machine Learning".

Best wishes and best regards,


Read more…

For those coming in late, IoT is the network of physical objects or "things" embedded with electronics, software, sensors and connectivity to enable it to achieve greater value and service by exchanging data with the manufacturer, operator and/or other connected devices. Each thing is uniquely identifiable through its embedded computing system but is able to interoperate within the existing Internet infrastructure.

Source for picture: Wikipedia IoT article

The key underlying theme of the IoT sector is the amount of data that will be generated by these inter-connected devices. Given the growth of these connected devices & the IoT sector, we decided to look at career & professional opportunities in this sector.

Here are the top companies hiring for IoT related positions:

1 PTC - The Product Development Company: PTC provides technology solutions that transform how products are created and serviced, helping companies achieve product and service advantage
2 Amazon: The leader in e-commerce and cloud computing
3 Continental: Continental is a leading automotive suppliers worldwide
4 Savi Group: Savi Technology provides sensor-based analytics, software and hardware for managing and securing supply chain assets
5 Intel: Intel is one of the world's largest and highest valued semiconductor chip makers
6 Ayla Networks: Ayla Networks enables manufacturers and service providers to bring connected products to market quickly and securely
7 HP: HP provides hardware, software and services to consumers, small- and medium-sized businesses, large enterprises, governments
8 LogMeIn, Inc: LogMeIn provides SaaS and cloud-based remote connectivity services for collaboration, IT management and customer engagement
9 Red Hat, Inc: Red Hat provides open-source software products to the enterprise community
10 Honeywell: Produces a variety of commercial and consumer products, engineering services, and aerospace systems
11 IBM: IBM manufactures and markets computer hardware and software, and offers infrastructure, hosting and consulting services in areas ranging from mainframe computers to nanotechnology
12 Renesas: Renesas is a semiconductor manufacturer
13 Cisco Systems, Inc : Designs, manufactures, and sells networking equipment
14 Dell: Develops, sells, repairs and supports computers and related products and services
15 InterDigital: InterDigital develops wireless technologies for mobile devices, networks, and services


Apart from these companies other large companies like Booz Allen Hamilton, Informatica, Bosch Software, Verizon Wireless, Accenture are also hiring for various IoT related positions.

Given below is a map & the top locations that companies are hiring for IoT positions


1 Santa Clara, CA
2 Seattle, WA
3 Chicago, IL
4 Boston, MA
5 Austin, TX
6 Washington, DC
7 Alexandria, VA
8 San Francisco, CA
9 Sunnyvale, CA
10 Glendale, CA

Here is a more comprehensive list of job titles that companies are looking to hire for:


Data related jobs

Big Data Lead (IoT)

Data Scientist - IoT

Data Engineer - Sensors and IoT

Data Engineer Sensors and IoT Applications


Director level positions

Director of Data Engineering Sensors, Sensor Analytics and IOT

Director of DevOps - Sensors, Sensor Analytics, and IoT

Director of Product Management, Industrial IoT

Director of Sales - IoT/RFID/GPS/AutoID/Sensor Technologies

Director, Business Development, IoT (The Internet of Things)

Internet of Things (IOT) & Connectivity / Mobility Sales & Director

Product Marketing Director: IoT Platform Startup

Research Director: Information Assurance and IoT Security

Sales Director (IoT)

Internet of Things (IoT) Worldwide Sales Leader


Architect level positions

Azure Cloud Architect (IoT)

Digital Operations - IoT Consultant/Architect

Internet of Things (IoT) / Cloud Archtect

IoT Fog Architect

IoT Software Architect "Internet of Things" Cloud

IoT Solutions Architect

Senior Electrical Architect for IoT

System Architect / IoT/Emerging Technologies


Project and Product manager level positions

Sr. Project Manager-IoT

Healthcare Facilities IOT- Project Manager (C)

Senior Product Manager - IoT Operating System

IOT Platform Product Manager

Software Product Manager/Product Owner Wearables/IoT

Product Development Program Manager - Wearables - IoT

Product Manager - Internet of Things (IOT) Smart Cities

Software Product Manager/Product Owner Wearables/IoT


Marketing Manager positions

Product Marketing Manager, IoT Solutions

Strategic Marketing Manager, IoT

Segment Marketing Manager - IoT Technologies

Marketing Manager, Demand Generation (IoT - PaaS)

Staff Product Marketing Manager - IoT Content Specialist


Business Development Manager level positions

Business Development Manager (IoT)

Embedded and IoT Market Development Manager

Integrated Operations Team (IOT) Business Manager

IoT Strategic Business Development Manager

Product Business Manager - Wearables - IoT

Strategic Business Development Manager - IoT


Other manager Level Positions

Connected Spaces IoT - Manager

Manager Emerging Technology (IoT Mobile NLP Big Data)

IoT/Cloud Infrastructure Program Manager

IOT Manager

Technology Manager (Mobile, IoT, NLP)


Global Strategic Partnerships Manager - IoT

Regional Sales Manager - Regional Sales Manager, IOT


Consultant Positions

Connected Spaces IoT Consultant

Digital Operations - IoT Consultant/Architect


Software Engineering Positions

Software Engineer Mobile Apps - IoT

Senior Software Engineer, Cloud Services (IoT PaaS)

Staff/Senior Staff Software Engineer, Internet of Things

Software Engineer - IoT

Associate Software Engineer - IoT

Applications Engineer- IoT mbed

Software Engineer Sensors and IoT Applications

Software Engineer IoT and Sensors

C++ Software Developer (Smart Lighting IoT)

IoT Developer

Senior Development Engineer, Mobile (IoT PaaS)


Java Developer Positions

Java Developer for Internet of Things (IoT)

Java Developer (IoT & M2M)

Java Developer Consultant - IoT


Mobile Developer Positions

IoT Mobile Application Engineer

Senior Mobile QA Engineer (IoT, PaaS)

IoT Android Engineer

Senior Android Developer - IoT


Test Engineer Positions

IOT Test Engineer with QXDM

LTE IOT Test Engineer

Senior SW Test Engineer-IoT

Sr Test Engineer III (IOT)


Technical Support Positions

Technical Support Representative, IoT

Technician IoT Devices Support

IoT Application Support Engineer


Intern Positions

Intern Intern/ Co-op- M2M/ IoT Security

IOT Software Developer Intern - (IOTG Intern)

SSG - Graduate Intern (IoT and UPM Support)

IoT Product Marketing Intern

IOT Intern

IOT Software Developer Intern - (IOTG Intern)


Other Interesting Positions

Transition Planner IOT (C)

IoT Certication Manager

IoT Competitive Specialist (IOTG)

Originally posted on Data Science Central


Read more…

Lambda Architecture for Big Data Systems

Big data analytical ecosystem architecture is in early stages of development. Unlike traditional data warehouse / business intelligence (DW/BI) architecture which is designed for structured, internal data, big data systems work with raw unstructured and semi-structured data as well as internal and external data sources. Additionally, organizations may need both batch and (near) real-time data processing capabilities from big data systems.

Lambda architecture - developed by Nathan Marz - provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples.

Lambda architecture has three (3) layers:

  • Batch Layer
  • Serving Layer
  • Speed Layer

Batch Layer (Apache Hadoop)

Hadoop is an open source platform for storing massive amounts of data. Lambda architecture provides "human fault-tolerance" which allows simple data deletion (to remedy human error) where the views are recomputed (immutability and recomputation).

The batch layer stores the master data set (HDFS) and computes arbitrary views (MapReduce). Computing views is continuous: new data is aggregated into views when recomputed during MapReduce iterations. Views are computed from the entire data set and the batch layer does not update views frequently resulting in latency.

Serving Layer (Real-time Queries)

The serving layer indexes and exposes precomputed views to be queried in ad hoc with low latency. Open source real-time Hadoop query implementations like Cloudera Impala, Hortonworks Stinger, Dremel (Apache Drill) and Spark Shark can query the views immediately. Hadoop can store and process large data sets and these tools can query data fast. At this time Spark Shark outperforms considering in-memory capabilities and has greater flexibility for Machine Learning functions.

Note that MapReduce is high latency and a speed layer is needed for real-time.

Speed Layer (Distributed Stream Processing)

The speed layer compensates for batch layer high latency by computing real-time views in distributed stream processing open source solutions like Storm and S4. They provide:

  • Stream processing
  • Distributed continuous computation
  • Fault tolerance
  • Modular design

In the speed layer real-time views are incremented when new data received. Lambda architecture provides "complexity isolation" where real-time views are transient and can be discarded allowing the most complex part to be moved into the layer with temporary results.

The decision to implement Lambda architecture depends on need for real-time data processing and human fault-tolerance. There are significant benefits from immutability and human fault-tolerance as well as precomputation and recomputation.

Lambda implementation issues include finding the talent to build a scalable batch processing layer. At this time there is a shortage of professionals with the expertise and experience to work with Hadoop, MapReduce, HDFS, HBase, Pig, Hive, Cascading, Scalding, Storm, Spark Shark and other new technologies.

Originally posted on Data Science Central
Read more…

Quantifying the Value of a NoSQL Project

Originally posted on Data Science Central

Summary:  If you’re making the decision to use NoSQL, how do you quantify the value of the investment?

If you are exploring NoSQL, once you become educated on the basics there are two questions that will rapidly move to the top of your list of considerations.  

  • What does it cost? 
  • What’s the dollar payoff?

The cost side is more easily addressed since you can gather the various cost elements of hardware and software and the additional costs of direct and indirect manpower and add these up.  Less straightforward is the issue of estimating the dollar benefit.

In this article we’ll assume that you’re looking at the possibility of storing large quantities of data to supplement your existing transactional files.  This could be geographic, RFID, sensor, text, or any of the other types of unstructured and semi-structured data for which NoSQL is ideal.  In broad terms we’re talking about key-value-stores or document-oriented DBs, less about graph or columnar DBs.  Think Hadoop, Mongo, Cloudera, or one of the other competitors in this space.  What you’ve already figured out or at least strongly suspect is that this is a whole lot more data than you currently have probably by a factor of 10X to 1,000X, and it’s pretty clear that your RDBMS is not the place to put it.

There are two broad categories of benefit in NoSQL that each need to be considered separately.  One is easy to quantify, the other one less so.

Distributed Storage

You’ve got to put the data somewhere and that hardware costs money.  Your current RDBMS data warehouse and transactional systems reside on high reliability (expensive) servers and the cost of the RDBMS software can be equally as costly depending on which brand name you’re using.

The well-known benefit of NoSQL is that many of these like Hadoop and its variants are open source and therefore quite inexpensive compared to brand names like IBM, Oracle, SAP, and the like.  Further, because of some unique architecting that we discussed in previous posts, NoSQL can safely be run on commodity hardware which is significantly less expensive than high reliability servers.

What does all this add up to dollar-wise?  Brian Garrett, the Hadoop lead at SAS offers these approximate comparative numbers:

  • $15,000 per terabyte to store data on an appliance.
  • $5,000 per terabyte to store data on a SAN (storage area network)
  • Less than $1,000 per terabyte to store data on an open source NoSQL DB like Hadoop.

Distributed Processing

The dollar savings from distributed NoSQL storage are pretty straightforward.  The savings or value of distributed processing however depends a lot on your business and is a much tougher question.  It’s particularly tough because you may need to value the benefit of types of analysis that your company isn’t or can’t do right now.

For example, let’s say your company is already using predictive analytics to guide profitable marketing campaigns.  You’ve got your transactional data whipped into pretty good shape in a RDBMS data warehouse and have a team of analysts and data scientists on board and they’re doing a good job.  What’s the added value of being able to analyze much more massive data sets?

There are two scenarios at work here.  One is that you have a very large number of customers, say several million, and so far it’s been too cumbersome to run analytics across all of them at once.  The second is that you have a smaller and manageable number of customers but you want to add say text or geographic data to make your analytics more accurate and therefore more profitable.  Or you may have both of these at once.

The factor that works against you is time.  In some companies with very large customer bases it can take over night to many days to extract or model against the whole data set.  Here’s an example from financial services where their data scientists work to refine the best models to predict success of retail marketing campaigns.

Before adopting NoSQL this team was able to prepare about one iteration of their predictive model every 5 hours.  Basically the I/O from a single source database was the bottleneck.  NoSQLs like Hadoop, Mongo, or Cloudera however let you do a portion of the processing on each of the separate nodes then combine the results.  This parallel processing makes analytics much faster. 

After implementation they reduced the model iteration time to 6 minutes meaning that they could run 50 times as many experimental models in the same time that used to be required to do just one. 

The second factor at work here is that combining the newly available unstructured data into their models almost doubled the lift (accuracy) of their models.  The result of much higher throughput and the availability of wholly new types of data was that the productivity of this group sky rocketed and the average profitability based on their better refined and more informed models also increased dramatically.

There’s an important caveat here.  The analytic platform that your data scientists are going to use to access and process the NoSQL data stores must be able to benefit from the distributed processing capabilities of the NoSQL DB.  In the old model, data is extracted to a separate data store where the analytics take place.  For big data, the I/O will kill efficiency.  Many of the major analytic platforms like SAS, Alteryx, and Pivotal/Greenplum are specifically designed to move as much processing as possible back into the NoSQL database which is the only way to really benefit from this speed improvement.

In terms of finally estimating value, the storage cost savings can be straightforward to calculate.  The benefit from the new data types and distributed processing however require that you have a good idea of how the data will be used, and to use current use cases as bench marks for this increase in efficiency and accuracy as it applies to your situation.  That’s certainly more challenging but worth it to be able track the return from your project against your expectations.


October 17, 2014

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.


About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

The original blog can be viewed at:

Read more…

By Ajit Jaokar @ajitjaokar Please connect with me if you want to stay in touch on linkedin and for future updates

Cross posted from my blog - I look forward to discussion/feedback here

Note: The paper below is best read as a pdf which you can download from the blog for free

Background and Abstract

This article is a part of an evolving theme. Here, I explain the basics of Deep Learning and how Deep learning algorithms could apply to IoT and Smart city domains. Specifically, as I discuss below, I am interested in complementing Deep learning algorithms using IoT datasets. I elaborate these ideas in the Data Science for Internet of Things program which enables you to work towards being a Data Scientist for the Internet of Things  (modelled on the course I teach at Oxford University and UPM – Madrid). I will also present these ideas at the International conference on City Sciences at Tongji University in Shanghai  and the Data Science for IoT workshop at the Iotworld event in San Francisco

Please connect with me if you want to stay in touch on linkedin and for future updates

Deep Learning

Deep learning is often thought of as a set of algorithms that ‘mimics the brain’. A more accurate description would be an algorithm that ‘learns in layers’. Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts.

The obscure world of deep learning algorithms came into public limelight when Google researchers fed 10 million random, unlabeled images from YouTube into their experimental Deep Learning system. They then instructed the system to recognize the basic elements of a picture and how these elements fit together. The system comprising 16,000 CPUs was able to identify images that shared similar characteristics (such as images of Cats). This canonical experiment showed the potential of Deep learning algorithms. Deep learning algorithms apply to many areas including Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition etc


How does a Computer Learn?

To understand the significance of Deep Learning algorithms, it’s important to understand how Computers think and learn. Since the early days, researchers have attempted to create computers that think. Until recently, this effort has been rules based adopting a ‘top down’ approach. The Top-down approach involved writing enough rules for all possible circumstances.  But this approach is obviously limited by the number of rules and by its finite rules base.

To overcome these limitations, a bottom-up approach was proposed. The idea here is to learn from experience. The experience was provided by ‘labelled data’. Labelled data is fed to a system and the system is trained based on the responses. This approach works for applications like Spam filtering. However, most data (pictures, video feeds, sounds, etc.) is not labelled and if it is, it’s not labelled well.

The other issue is in handling problem domains which are not finite. For example, the problem domain in chess is complex but finite because there are a finite number of primitives (32 chess pieces)  and a finite set of allowable actions(on 64 squares).  But in real life, at any instant, we have potentially a large number or infinite alternatives. The problem domain is thus very large.

A problem like playing chess can be ‘described’ to a computer by a set of formal rules.  In contrast, many real world problems are easily understood by people (intuitive) but not easy to describe (represent) to a Computer (unlike Chess). Examples of such intuitive problems include recognizing words or faces in an image. Such problems are hard to describe to a Computer because the problem domain is not finite. Thus, the problem description suffers from the curse of dimensionality i.e. when the number of dimensions increase, the volume of the space increases so fast that the available data becomes sparse. Computers cannot be trained on sparse data. Such scenarios are not easy to describe because there is not enough data to adequately represent combinations represented by the dimensions. Nevertheless, such ‘infinite choice’ problems are common in daily life.

How do Deep learning algorithms learn?

Deep learning is involved with ‘hard/intuitive’ problem which have little/no rules and high dimensionality. Here, the system must learn to cope with unforeseen circumstances without knowing the Rules in advance. Many existing systems like Siri’s speech recognition and Facebook’s face recognition work on these principles.  Deep learning systems are possible to implement now because of three reasons: High CPU power, Better Algorithms and the availability of more data. Over the next few years, these factors will lead to more applications of Deep learning systems.

Deep Learning algorithms are modelled on the workings of the Brain. The Brain may be thought of as a massively parallel analog computer which contains about 10^10 simple processors (neurons) – each of which require a few milliseconds to respond to input. To model the workings of the brain, in theory, each neuron could be designed as a small electronic device which has a transfer function similar to a biological neuron. We could then connect each neuron to many other neurons to imitate the workings of the Brain. In practise,  it turns out that this model is not easy to implement and is difficult to train.

So, we make some simplifications in the model mimicking the brain. The resultant neural network is called “feed-forward back-propagation network”.  The simplifications/constraints are: We change the connectivity between the neurons so that they are in distinct layers. Each neuron in one layer is connected to every neuron in the next layer. Signals flow in only one direction. And finally, we simplify the neuron design to ‘fire’ based on simple, weight driven inputs from other neurons. Such a simplified network (feed-forward neural network model) is more practical to build and use.


a)      Each neuron receives a signal from the neurons in the previous layer

b)      Each of those signals is multiplied by a weight value.

c)      The weighted inputs are summed, and passed through a limiting function which scales the output to a fixed range of values.

d)      The output of the limiter is then broadcast to all of the neurons in the next layer.

Image and parts of description in this section adapted from : Seattle robotics site

The most common learning algorithm for artificial neural networks is called Back Propagation (BP) which stands for “backward propagation of errors”. To use the neural network, we apply the input values to the first layer, allow the signals to propagate through the network and read the output. A BP network learns by example i.e. we must provide a learning set that consists of some input examples and the known correct output for each case. So, we use these input-output examples to show the network what type of behaviour is expected. The BP algorithm allows the network to adapt by adjusting the weights by propagating the error value backwards through the network. Each link between neurons has a unique weighting value. The ‘intelligence’ of the network lies in the values of the weights. With each iteration of the errors flowing backwards, the weights are adjusted. The whole process is repeated for each of the example cases. Thus, to detect an Object, Programmers would train a neural network by rapidly sending across many digitized versions of data (for example, images)  containing those objects. If the network did not accurately recognize a particular pattern,  the weights would be adjusted. The eventual goal of this training is to get the network to consistently recognize the patterns that we recognize (ex Cats).

How does Deep Learning help to solve the intuitive problem

The whole objective of Deep Learning is to solve ‘intuitive’ problems i.e. problems characterized by High dimensionality and no rules.  The above mechanism demonstrates a supervised learning algorithm based on a limited modelling of Neurons – but we need to understand more.

Deep learning allows computers to solve intuitive problems because:

  • With Deep learning, Computers can learn from experience but also can understand the world in terms of a hierarchy of concepts – where each concept is defined in terms of simpler concepts.
  • The hierarchy of concepts is built ‘bottom up’ without predefined rules by addressing the ‘representation problem’.

This is similar to the way a child learns ‘what a dog is’ i.e. by understanding the sub-components of a concept ex  the behavior(barking), shape of the head, the tail, the fur etc and then putting these concepts in one bigger idea i.e. the Dog itself.

The (knowledge) representation problem is a recurring theme in Computer Science.

Knowledge representation incorporates theories from psychology which look to understand how humans solve problems and represent knowledge.  The idea is that: if like humans, Computers were to gather knowledge from experience, it avoids the need for human operators to formally specify all of the knowledge that the computer needs to solve a problem.

For a computer, the choice of representation has an enormous effect on the performance of machine learning algorithms. For example, based on the sound pitch, it is possible to know if the speaker is a man, woman or child. However, for many applications, it is not easy to know what set of features represent the information accurately. For example, to detect pictures of cars in images, a wheel may be circular in shape – but actual pictures of wheels may have variants (spokes, metal parts etc). So, the idea of representation learning is to find both the mapping and the representation.

If we can find representations and their mappings automatically (i.e. without human intervention), we have a flexible design to solve intuitive problems.   We can adapt to new tasks and we can even infer new insights without observation. For example, based on the pitch of the sound – we can infer an accent and hence a nationality. The mechanism is self learning. Deep learning applications are best suited for situations which involve large amounts of data and complex relationships between different parameters. Training a Neural network involves repeatedly showing it that: “Given an input, this is the correct output”. If this is done enough times, a sufficiently trained network will mimic the function you are simulating. It will also ignore inputs that are irrelevant to the solution. Conversely, it will fail to converge on a solution if you leave out critical inputs. This model can be applied to many scenarios as we see below in a simplified example.

An example of learning through layers

Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts. This approach works for subjective and intuitive problems which are difficult to articulate.

Consider image data. Computers cannot understand the meaning of a collection of pixels. Mappings from a collection of pixels to a complex Object are complicated.

With deep learning, the problem is broken down into a series of hierarchical mappings – with each mapping described by a specific layer.

The input (representing the variables we actually observe) is presented at the visible layer. Then a series of hidden layers extracts increasingly abstract features from the input with each layer concerned with a specific mapping. However, note that this process is not pre defined i.e. we do not specify what the layers select

For example: From the pixels, the first hidden layer identifies the edges

From the edges, the second hidden layer identifies the corners and contours

From the corners and contours, the third hidden layer identifies the parts of objects

Finally, from the parts of objects, the fourth hidden layer identifies whole objects

Image and example source: Yoshua Bengio book – Deep Learning

Implications for IoT

To recap:

  • Deep learning algorithms apply to many areas including Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition etc
  • Deep learning systems are possible to implement now because of three reasons: High CPU power, Better Algorithms and the availability of more data. Over the next few years, these factors will lead to more applications of Deep learning systems.
  • Deep learning applications are best suited for situations which involve large amounts of data and complex relationships between different parameters.
  • Solving intuitive problems: Training a Neural network involves repeatedly showing it that: “Given an input, this is the correct output”. If this is done enough times, a sufficiently trained network will mimic the function you are simulating. It will also ignore inputs that are irrelevant to the solution. Conversely, it will fail to converge on a solution if you leave out critical inputs. This model can be applied to many scenarios

In addition, we have limitations in the technology. For instance, we have a long way to go before a Deep learning system can figure out that you are sad because your cat died(although it seems Cognitoys based on IBM watson is heading in that direction). The current focus is more on identifying photos, guessing the age from photos(based on Microsoft’s project Oxford API)

And we have indeed a way to go as Andrew Ng reminds us to think of Artificial Intelligence as building a rocket ship

“I think AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. If you have a large engine and a tiny amount of fuel, you won’t make it to orbit. If you have a tiny engine and a ton of fuel, you can’t even lift off. To build a rocket you need a huge engine and a lot of fuel. The analogy to deep learning [one of the key processes in creating artificial intelligence] is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.”

Today, we are still limited by technology from achieving scale. Google’s neural network that identified cats had 16,000 nodes. In contrast, a human brain has an estimated 100 billion neurons!

There are some scenarios where Back propagation neural networks are suited

  • A large amount of input/output data is available, but you’re not sure how to relate it to the output. Thus, we have a larger number of “Given an input, this is the correct output” type scenarios which can be used to train the network because it is easy to create a number of examples of correct behaviour.
  • The problem appears to have overwhelming complexity. The complexity arises from Low rules base and a high dimensionality and from data which is not easy to represent.  However, there is clearly a solution.
  • The solution to the problem may change over time, within the bounds of the given input and output parameters (i.e., today 2+2=4, but in the future we may find that 2+2=3.8) and Outputs can be “fuzzy”, or non-numeric.
  • Domain expertise is not strictly needed because the output can be purely derived from inputs: This is controversial because it is not always possible to model an output based on the input alone. However, consider the example of stock market prediction. In theory, given enough cases of inputs and outputs for a stock value, you could create a model which would predict unknown scenarios if it was trained adequately using deep learning techniques.
  • Inference:  We need to infer new insights without observation. For example, based on the pitch of the sound – we can infer an accent and hence a nationality

Given an IoT domain, we could consider the top-level questions:

  • What existing applications can be complemented by Deep learning techniques by adding an intuitive component? (ex in smart cities)
  • What metrics are being measured and predicted? And how could we add an intuitive component to the metric?
  • What applications exist in Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition etc which also apply to IoT

Now, extending more deeply into the research domain, here are some areas of interest that I am following.

Complementing Deep Learning algorithms with IoT datasets

In essence, these techniques/strategies complement Deep learning algorithms with IoT datasets.

1)      Deep learning algorithms and Time series data : Time series data (coming from sensors) can be thought of as a 1D grid taking samples at regular time intervals, and image data can be thought of as a 2D grid of pixels. This allows us to model Time series data with Deep learning algorithms (most sensor / IoT data is time series).  It is relatively less common to explore Deep learning and Time series – but there are some instances of this approach already (Deep Learning for Time Series Modelling to predict energy loads using only time and temp data  )

2)      Multiple modalities: multimodality in deep learning. Multimodality in deep learning algorithms is being explored  In particular, cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time

3)      Temporal patterns in Deep learning: In their recent paper, Ph.D. student Huan-Kai Peng and Professor Radu Marculescu, from Carnegie Mellon University’s Department of Electrical and Computer Engineering, propose a new way to identify the intrinsic dynamics of interaction patterns at multiple time scales. Their method involves building a deep-learning model that consists of multiple levels; each level captures the relevant patterns of a specific temporal scale. The newly proposed model can be also used to explain the possible ways in which short-term patterns relate to the long-term patterns. For example, it becomes possible to describe how a long-term pattern in Twitter can be sustained and enhanced by a sequence of short-term patterns, including characteristics like popularity, stickiness, contagiousness, and interactivity. The paper can be downloaded HERE

Implications for Smart cities

I see Smart cities as an application domain for Internet of Things. Many definitions exist for Smart cities/future cities. From our perspective, Smart cities refer to the use of digital technologies to enhance performance and wellbeing, to reduce costs and resource consumption, and to engage more effectively and actively with its citizens (adapted from Wikipedia). Key ‘smart’ sectors include transport, energy, health care, water and waste. A more comprehensive list of Smart City/IoT application areas are: Intelligent transport systems – Automatic vehicle , Medical and Healthcare, Environment , Waste management , Air quality , Water quality, Accident and  Emergency services, Energy including renewable, Intelligent transport systems  including autonomous vehicles. In all these areas we could find applications to which we could add an intuitive component based on the ideas above.

Typical domains will include Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition. Of special interest are new areas such as the Self driving cars – ex theLutz pod and even larger vehicles such as self driving trucks


Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts. Deep learning is used to address intuitive applications with high dimensionality.  It is an emerging field and over the next few years, due to advances in technology, we are likely to see many more applications in the Deep learning space. I am specifically interested in how IoT datasets can be used to complement deep learning algorithms. This is an emerging area with some examples shown above. I believe that it will have widespread applications, many of which we have not fully explored(as in the Smart city examples)

I see this article as part of an evolving theme. Future updates will explore how Deep learning algorithms could apply to IoT and Smart city domains. Also, I am interested in complementing Deep learning algorithms using IoT datasets.

I elaborate these ideas in the Data Science for Internet of Things program  (modelled on the course I teach at Oxford University and UPM – Madrid). I will also present these ideas at the International conference on City Sciences at Tongji University in Shanghai  and the Data Science for IoT workshop at the Iotworld event in San Francisco

Please connect with me if you want to stay in touch on linkedin and for future updates

Originally posted on Data Science Central

Read more…

Caltrain Quantified: An Exploration in IoT

Guest blog post by Cameron Turner

Executive Summary

Though often the focus of the urban noise debate, Caltrain is one of many contributors to overall sound levels along the Bay Area’s peninsula corridor. In this investigation, Cameron Turner of Palo Alto’s The Data Guild takes a look at this topic using a custom-built Internet of Things (IoT) sensor atop the Helium networking platform.


If you live in (or visit) the Bay Area, chances are you have experience with the Caltrain. Caltrain is a commuter line which travels 77.4 miles between San Francisco and San Jose , carrying over 50 thousand passengers on over 70 trains daily.[1]

I’m lucky to live two blocks from the Caltrain line, and enjoy the convenience of the train. My office, The Data Guild, is just one block away. The Caltrain and its rhythms, bells and horns are a part of our daily life, and connect us to the City and with connections to BART, Amtrak, SFO and SJC, the rest of the world.

Over the holidays, my 4-year-old daughter and I undertook a project to quantify the Caltrain through a custom-built sensor and reporting framework, to get some first-hand experience in the so-called Internet of Things (IoT). This project also aligns with The Data Guild’s broader ambition to build out custom sensor systems atop network technologies to address global issues. (More on this here.)

Let me note here that this project was an exploration, and was not conducted in a manner (in goals or methodology) to provide fodder for either side of the many ongoing caltrain debates: the electrification project, quiet zone, or tragic recent deaths on the tracks.


My interest in such a project began with an article published in the Palo Alto Daily in October 2014. The article addressed the call for a quiet zone in downtown Palo Alto, following complaints from residents of buildings closest to the tracks. Many subjective frustrations were made by residents based on personal experience.

According the the Federal Railroad Administration (FRA), the rules by which Caltrain operates, train engineers “must begin to sound train horns at least 15 seconds, and no more than 20 seconds, in advance of all public grade crossings.”

Additionally: “Train horns must be sounded in a standardized pattern of 2 long, 1 short and 1 long blasts.” and “The maximum volume level for the train horn is 110 decibels which is a new requirement. The minimum sound level remains 96 decibels.“


Given the numeric nature of the rules, and the subjective nature of current analysis/discussion, it seemed an ideal problem to address with data. Some of the questions we hoped to address including and beyond this issue:

  • Timing: Are train horns sounded at the appropriate time?
  • Schedule: Are Caltrains coming and going on time?
  • Volume: Are the Caltrain horns sounding at the appropriate level?
  • Relativity: How do Caltrain horns contribute to overall urban noise levels?


Our methodology to address these topics included several steps:

  1. Build a custom sensor equipped to capture ambient noise levels
  2. Leverage an uplink capability to receive data from the sensor in near real-time
  3. Deploy sensor then monitor sensor output and test/modify as needed
  4. Develop a crude statistical model to convert sensor levels (voltage) to sound levels (dB)
  5. Analysis and reporting


We developed a simple sensor based on the Arduino platform. A baseline Uno board, equipped with a local ATmega328 processor, was wired to and Adafruit Electret Microphone/Amplifier 4466 w/adjustable gain.

We were lucky to be introduced through the O’Reilly Strata NY event to a local company: Helium. Backed by Khosla Ventures et al, Helium is building an internet of things platform for smart machines. They combine a wireless protocol optimized for device and sensor data with cloud-based tooling for working with the data and building applications.

We received a Beta Kit which included a Arduino shield for uplink to their bridge device, which then connects via GSM to the Internet. Here is our sensor (left) with the Helium bridge device (right).


With our instrument ready for deployment, we sought to find a safe location to deploy. By good fortune, a family friend (and member of the staff of the Stanford Statistics department, where I am completing my degree) owns a home immediately adjacent to a Caltrain crossing, where Caltrain operators are required to sound their horn.

Conductors might also be particularly sensitive to this crossing, Churchill St., due to its proximity to Palo Alto High School and the tragic train-related death of a teen, recently.

From a data standpoint, this location was ideal as it sits approximately half-way between the Palo Alto and California Avenue stations.

We deployed our sensor outdoors facing the track in a waterproof enclosure and watched the first data arrive.


Through a connector to Helium’s fusion platform, we were able to see data in near real-time. (note the “debug” window on the right, where microphone output level arrives each second).

We used another great service, provided by Librato, (now a part of SolarWinds) a San Francisco-based monitoring and metrics company. Using Librato, we enabled data visualization of the sound levels as they were generated. We were able to view this relative to its history. This was a powerful capability as we worked to fine-tune the power and amplifier.

Note the spike in the middle of the image above, which we could map to a train horn heard ourselves during the training period.

Data Preparation

Next, we took a weekday (January 7, 2015), which appeared typical of a non-holiday weekday relative to the entire month of data collected. For this period, we were able to construct a 24-hour data set at 1-second sample intervals for our analysis.

Data was accessed through the Librato API, downloaded as JSON, converted to CSV and cleansed.


First, to gain intuition, we took a sample recording gathered at the sensor site of a typical train horn.

Click HERE to hear the sample sound.

Using matplotlib within an ipython notebook, we are able to “see” this sound, in both its raw audio form and as a spectrogram showing frequency:

Next, we look at our entire 24 hours of data, beginning on the evening of January 6, and concluding 24 hours later on the evening of January 7th. Note the quiet “overnight” period, about a quarter of the way across the x axis.

To put this into context, we overlay the Caltrain schedule. Given the sensor sits between the Palo Alto and California Avenue stations, and given the variance in stop times, we mark northbound trains using the scheduled stop at Palo Alto (red), and southbound trains using the scheduled stop at California Ave (green).

Initially, we can make two converse observations: many peak sound events tend to lie quite close to these stop times, as expected. However: many of the sound events (including the maximum recorded value, the nightly ~11pm freight train service) occur independent of the scheduled Caltrains.

Conversion to Decibels

On the Y axis above, the sound level is reported in the raw voltage output from the Microphone. To address the questions above we needed a way to convert these values to decibel units (dB).

To do so, a low-cost sound meter was obtained from Fry’s. Then an on-site calibration was performed to map decibel readings from the sensor to the voltage output uploaded from our microphone.

Within R Studio, these values were plotted and a crude estimation function was derived to create a linear mapping between voltage and dB:

The goal of doing a straight line estimate vs. log-linear was to compensate for differences in apparatus (dB meter vs. microphone within casing) and overall to maintain conservative approximations. Most of the events in question during the observation period were between 2.0 and 2.5 volts, where we collected several training points (above).

A challenge in this process was the slight lag between readings and data collection with unknown variance. As such, only “peak” and “trough” measurements could be used reliably to build the model.

With this crude conversion estimator in hand, we would now replot the data above with decibels on the y axis.

Clearly the “peaks” above are of interest as outliers from the baseline noise level at this site. In fact, there are 69 peaks (>82 dB) observed (at 1-second sample rate), and 71 scheduled trains for this same period. Though this location was about 100 yards removed from the tracks, the horns are quieter than the recommended 96dB-115dB range recommended by the FRA. (With caveat above re: crude approximator)

Interesting also that we’re not observing the “two long-two short-one long” pattern. Though some events are lost to the sampling rate, qualitatively this does not seem to be a standard practice followed by the engineers. Those who live in Palo Alto also know this to be true, qualitatively.

Also worth noting is the high variance of ambient noise, the central horizontal blue “cloud” above, ranging from ~45 dB to ~75 dB. We sought to understand the nature of this variance and whether it contained structure.

Looking more closely at just a few minutes of data during the Jan 7 morning commute, we can see that indeed there is a periodic structure to the variance.

In comparing to on-site observations, we could determine that this period was defined by the traffic signal which sits between the sensor and the train tracks, on Alma St. Additionally, we often observe an “M” structure (bimodal peak) indicating the southbound traffic accelerating from the stop line when the light turned green, followed by the passing northbound traffic seconds later.

Looking at a few minutes of the same morning commute, we can clearly see when the train passed and sounded its horn. Here again, green indicates a southbound train, red indicates and northbound train.

In this case, the southbound train passed slightly before its scheduled arrival time at the California Avenue station, and the Northbound train passed within its scheduled arrival minute, both on time. Note also the peak unassociated with the train. We’ll discuss this next.

Perhaps a more useful summary of the data collected is shown as a histogram, where the decibels are shown on the X axis and the frequency (count) is shown on the Y axis.

We can clearly see a bimodal distribution, where sound is roughly normally distributed, with a second distribution at the higher end. The question still remained why several of the peak observed values fell nowhere near the scheduled train time?

The answer here requires no sensors: airplanes, sirens and freight trains are frequent noise sources in Palo Alto. These factors, coupled with a nearby residential construction project accounted for the non-regular noise events we observed.

Click HERE to hear a sample sound.

Finally, we subsetted the data into three groups, one to look at non-Train minutes, one to look at northbound train minutes and one to look at southbound train minutes. The mean dB levels were 52.13, 52.18 and 52.32 respectively. While the order here makes sense, these samples bury the outcome since a horn blast may only be one second of a train-minute. The difference between northbound and southbound are consistent with on-site observation-- given the sensor lies on the northeast corner of the crossing, horn blasts from southbound trains were more pronounced.


Before making any conclusions it should be noted again that these are not scientific findings, but rather an attempt to add some rigor to the discussion around Caltrain and noise pollution. Further study with a longer period of analysis and duplicity of data collection would be required to statistically state these conclusions.

That said, we can readdress the topics in question:

Timing: Are train horns sounded at the appropriate time?

The FRA recommends engineers sound their horn between 15 and 20 seconds before a crossing. Given the tight urban nature of this crossing this recommendation seems a misfit. Caltrain engineers are sounding within 2-3 seconds of the crossing, which seems more appropriate.

Schedule: Are Caltrains coming and going on time?

Though not explored in depth here, generally we can observe that trains are passing our sensor prior to their scheduled arrival at the upcoming station.

Volume: Are the Caltrain horns sounding at the appropriate level?

As discussed above, the apparent dB level at a location very close to the track was well below the FRA recommended levels.

Relativity: How do Caltrain horns contribute to overall urban noise levels?

The Caltrain horns generate roughly an additional 10dB to peak baseline noise levels, including period traffic events at the intersection observed.


Due to their regular frequency and physical presence, trains are an easy target when it comes to urban sound attenuation efforts. However, the regular oscillations of traffic, sirens, airplanes and construction create a very high, if not predictable baseline above which trains must be heard.

Considering the importance of safety to this system, which operates just inches from bikers, drivers and pedestrians, there is a tradeoff to be made between supporting quiet zone initiatives and the capability of speeding trains to be heard.

In Palo Alto, as we move into an era of electric cars, improved bike systems and increased pedestrian access, the oscillations of noise created by non-train activities may indeed subside over time. And this in turn, might provide an opportunity to lower the “alert sounds” such as sirens and train horns required to deliver these services safely. Someday much of our everyday activity might be accomplished quietly.

Until then, we can only appreciate these sounds which must rise above our noisy baseline, as a reminder of our connectedness to the greater bay area through our shared focus on safety and convenient public transportation.


Sincere thanks to Helen T. and Nick Parlante of Stanford University, Mark Phillips of Helium and Nik Wekwerth/Jason Derrett/Peter Haggerty of Librato for their help and technical support.

Thanks also to my peers at The Data Guild, Aman, Chris, Dave and Sandy and the Palo Alto Police IT department for their feedback.

And thanks to my daughter Tallulah for her help soldering and moral support.


Originally posted on LinkedIn. 

Read more…

5 levels of machine learning

Guest blog post by derick.jose

There is a seismic shift underway in the engineering industries. The decreased cost of sensors, the increased amount of instrumentation on assets and need for new revenue streams are forcing engineering firms to reimagine business models.  The fusion of “atoms with bytes” promises to unlock new value previously unrecognized which generate additional revenue streams predicated on intelligence generated from the data. As machines increasingly become nodes in a vast array of industrial network, value is shifting towards the intelligence which controls machines. Intelligent Platformization of machines has begun

Keeping in mind this fundamental shift in value from atoms to intelligence, Flutura has defined 5 levels of maturity to assess the machine intelligence quotient of an engineering organisation. The highest level of maturity is "Facebook of machines" with ubiquitous sensor connectivity and the lowest is an asset which is "unplugged" where the device is offline. As organisations embark on a journey to intensify the intelligence layer in their IOT offering it makes sense to map where they are in their current state of maturity.

5 Levels of IOT  Intelligence

The 5 levels of machine intelligence with specific illustrative examples are outlined below


This is the lowest level in the maturity in the maturity map. At this level of maturity, the device or sensor is 'unplugged' from the network. There are no “eyes” to see the state of the machines at any point in time. The machine is offline to the engineering organisation. A vast majority of engineering firms manufacture assets which fall into this category. For example a vast variety of industrial pumps still are completely mechanical devices with no sensors to instrument them


This is the next level of machine intelligence which exists in the maturity curve. At this level of intelligence the device is connected to the network. There is also rudimentary intelligence exists on the device to take corrective healing action. Examples of assets having edge intelligence include cars which can alert the drivers to basic conditions which need intervention. Other examples include a boiler which has edge intelligence to switch on/switch off valves based on steam pressure

Read full article

Read more…

Space is limited.

Reserve your Webinar seat now

Join us June 23, 2015 at 9am PDT for our latest DSC's Webinar Series: Deriving Analytic Insights from Machine Data and IoT Sensors - CaseStudies sponsored by Teradata and Hortonworks.

Hadoop and The Internet of Things has enabled data driven companies to leverage new data sources and apply new analytical techniques in creative ways that provide competitive advantage. Beyond clickstream data, companies are finding transformational insights stemming from machine data and telemetry that are radically improving operational efficiencies and yielding new actionable customer insights.

We will discuss real world case studies from the field that describe the strategies, architectures, and results from forward thinking Fortune 500 organizations across a variety of verticals, including insurance, healthcare, media & entertainment, communications, and manufacturing.

Chad Meley, Vice President of Product & Services, Teradata
John Kreisa
Vice President of Marketing Strategy, Hortonworks

Hosted by
Bill Vorhies, Contributing EditorData Science Central & Data Magnum
Title:  Deriving Analytic Insights from Machine Data and IoT Sensors - CaseStudies
Date:  Tuesday, June 23, 2015
Time:  9:00 AM - 10:00 AM PDT
Again, Space is limited so please register early:
Reserve your Webinar seat now
After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Space is limited.

Reserve your Webinar seat now

Please join us on July 21, 2015 at 9am PDT for our latest Data Science Central Webinar Series: IoT: How Data Science-Driven Software is Eating the Connected World sponsored by Pivotal.

The Internet of Things (IoT) will forever change the way businesses interact with consumers and each other. To derive true value from these devices, and ultimately drive the next fundamental shift in how we live and operate, requires the ability to pool this data and build models that drive real and significant actions.

In this DSC webinar, one of Pivotal's principal data scientists will present a series of use cases illustrating how such devices and the data from these devices drives real impact across industries. From smart sensors to connected hospitals, each example will highlight the fundamental concepts to success. 

You will learn about:

  • Starting with the basics: How data science drives action and outcomes
  • Avoiding the obstacles: How to avoid the pitfalls that prevent models from driving real action
  • Building your toolbox: What tools are available

The DSC webinar will provide a unique look at new developments in the rapidly-changing world of IoT and data science.

Sarah Aerni, Senior Data Scientist​ -- Pivotal​

Hosted by
Bill VorhiesSenior Contributing EditorData Science Central
Title:  IoT: How Data Science-Driven Software is Eating the Connected World
Date:  Tuesday, July 21, 2015
Time:  9:00 AM - 10:00 AM PDT
Again, Space is limited so please register early:
Reserve your Webinar seat now
After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

What is the Internet of Everything (IoE)?

Originally posted by Vincent Granville on Data Science Central

Guest blog post by Peter Diamandis, chairman and CEO of the X PRIZE Foundation, best known for its $10 million Ansari X PRIZE for private spaceflight.  Today the X PRIZE leads the world in designing and operating large-scale global competitions to solve market failures.

Every month I hold a webinar for my Abundance 360 executive mastermind members that focuses on different exponential technologies impacting billion-person problems.

This week I interviewed Padma Warrior, CTO and Chief Strategist of Cisco, to discuss the Internet of Everything (IOE).

Padma is a brilliant and visionary person, one of the most important female leaders of this decade.

She first got my attention when she quoted a recent Cisco study placing the value of IoE as a $19 trillion opportunity.

This blog is about how you can tap into that $19 Trillion.

What is the Internet of Everything (IoE)?

The Internet of Everything describes the networked connections between devices, people, processes and data.

By 2020, the IoE has the potential to connect 50 billion people, devices and things.

In the next 10 years, Cisco is projecting IoE will generate $19 trillion of value – $14 trillion from the private sector, and $5 trillion from governments and public sectors (initiatives like smart cities and infrastructure).

Imagine a Connected World

Let me try to paint an IoE picture for you.

Imagine a world in which everything is connected and packed with sensors.

50+ billion connected devices, loaded with a dozen or more sensors, will create a trillion-sensor ecosystem.

These devices will create what I call a state of perfect knowledge, where we'll be able to know what we want, where we want, when we want.

Combined with the power of data mining and machine learning, the value that you can create and the capabilities you will have as an individual and as a business will be extraordinary.

Here are a few basic examples to get you thinking:

  • Retail: Beyond knowing what you purchased, stores will monitor your eye gaze, knowing what you glanced at… what you picked up and considered, and put back on the shelf. Dynamic pricing will entice you to pick it up again.
  • City Traffic: Cars looking for parking cause 40% of traffic in city centers. Parking sensors will tell your car where to find an open spot.
  • Lighting: Streetlights and house lights will only turn on when you're nearby.
  • Vineyards/Farming: Today IoE enables winemakers to monitor the exact condition (temperature, humidity, sun) of every vine and recommend optimal harvest times. IoE can follow details of fermentation and even assure perfect handling through distribution and sale to the consumer at the wine store.
  • Dynamic pricing: In the future, everything has dynamic pricing where supply and demand drives pricing. Uber already knows when demand is high, or when I'm stuck miles from my house, and can charge more as a result.
  • Transportation: Self-driving cars and IoE will make ALL traffic a thing of the past.
  • Healthcare: You will be the CEO of your own health. Wearables will be tracking your vitals constantly, allowing you and others to make better health decisions.
  • Banking/Insurance: Research shows that if you exercise and eat healthy, you're more likely to repay your loan. Imagine a variable interest rate (or lower insurance rate) depending on exercise patterns and eating habits?
  • Forests: With connected sensors placed on trees, you can make urban forests healthier and better able to withstand -- and even take advantage of -- the effects of climate change.
  • Office Furniture: Software and sensors embedded in office furniture are being used to improve office productivity, ergonomics and employee health.
  • Invisibles: Forget wearables, the next big thing is sensor-based technology that you can't see, whether they are in jewelry, attached to the skin like a bandage, or perhaps even embedded under the skin or inside the body. By 2017, 30% of wearables will be "unobtrusive to the naked eye," according to market researcher Gartner.

The Biggest Business Opportunities Will Be in Making Systems More Efficient

The Internet of Everything will become the nervous system of the human economy.

Entrepreneurs who capitalize on this will drive massive value and enable better decisions and reduce inefficiencies.

If you are an entrepreneur or running a business, you need to do two key things:

1. Digitize: Determine which of your processes are not yet digitized and find a way to digitize them. Then, collect data and analyze that data. Go from your old-style manual process (or data collection system) to an autonomous digital version.

2: Skate to the Puck: Have a brainstorm with the smartest members of your team (or find some local Singularity University alumni to join you) and ask yourselves the following questions:

  • What kind of sensors will exist in 3 years' time, and what kind of data could we be collecting?
  • In three years, which of our "things" will be connected and joining the Internet of Everything? With the answers to these two basic questions, come up with the business opportunities that will exist in three years… and begin developing the business models, developing the software and planning out your domination.

This is the sort of content and conversations we discuss at my 250-person executive mastermind group called Abundance 360. The program is ~88% filled. You can apply here.

Share this email with your friends, especially if they are interested in any of the areas outlined above.

We are living toward incredible times where the only constant is change, and the rate of change is increasing.


Read more…

The Internet of Things may be giving over to the Internet of Everything as more and more uses are dreamed up for the new wave of Smart Cities.

In the Internet of Things, objects have their own IP address, meaning that sensors connected to the web can send data to the cloud on just about anything: how much traffic is rolling through a stoplight, how much water you’re using, or how full a trash dumpster is.

Cities are discovering how they can use these new technologies — and the data they generate — to be more efficient and cost effective in many different ways. And it’s a good thing, too; some estimates suggest that 66 percent of the world’s population will live in urban areas by the year 2050.

These are cutting edge ideas, but here are some of the most fascinating ways Smart Cities are using big data and the Internet of Things to improve quality of life for their residents:

  • The city of Long Beach, California is using smart water meters to detect illegal watering in real time and have been used to help some homeowners cut their water usage by as much as 80 percent. That’s vital when the state is going through its worst drought in recorded history and the governor has enacted the first-ever state-wide water restrictions.
  • Los Angeles uses data from magnetic road sensors and traffic cameras to control traffic lights and thus the flow (or congestion) of traffic around the city. The computerized system controls 4,500 traffic signals around the city and has reduced traffic congestion by an estimated 16 percent.  
  • Xcel Energy initiated one of the first ever tests of a “smart grid” in Boulder, Colorado, installing smart meters on customers’ homes that would allow them to log into a website and see their energy usage in real time. The smart grid would also theoretically allow power companies to predict usage in order to plan for future infrastructure needs and prevent brown out scenarios.
  • A tech startup called Veniam is testing a new way to create mobile wi-fi hotspots all over the city in Porto, Portugal. More than 600 city buses and taxis have been equipped with wifi transmitters, creating the largest free wi-fi hotspot in the world. Veniam sells the routers and service to the city, which in turn provides the wi-fi free to citizens, like a public utility. In exchange, the city gets an enormous amount of data — with the idea being that the data can be used to offset the cost of the wi-fi in other areas. For example, in Porto, sensors tell the city’s waste management department when dumpsters are full, so they don’t waste time, man hours, or fuel emptying containers that are only partly full.
  • New York City is creating the world’s first “quantified community” where nearly everything about the environment and residents will be tracked. The community will be able to monitor pedestrian traffic flow, how much of the solid waste collected is recyclable or food waste, and air quality. The project will even collect data on residents’ health and activity levels through an opt-in mobile app.
  • Songdo, South Korea has been conceived and built as the ultimate Smart City — a city of the future. Trash collection in the city is completely automated, through pipes connected to every building. The solid waste is sorted then recycled, buried, or burned for fuel. The city is partnering with Cisco to test other technologies, including home appliances and utilities controlled by your smartphone, and even a tracking system for children (using microchips implanted in bracelets).

This is just the beginning of the integration of big data and the Internet of Things into daily life, but it is by no means the end. As our cities get smarter and begin collecting and sending more and more data, new uses will emerge that may revolutionize the way we live in urban areas.

Of course, more technology can also mean more opportunities for hackers and terrorists. (Anyone see Die Hard 4, where terrorists hacked the traffic control systems in Washington, D.C.?) The threat that a hacker could shut down a city’s power grid, traffic system, or water supply is real — mostly because the technology is so new that cities and providers are not taking the necessary steps to protect themselves.

Still, it would seem that the benefits will outweigh the risks with these new data-driven technologies for cities, so long as the municipalities are paying attention to security and protecting their assets and their customers.

What’s your opinion? Are you for or against more integrated technologies in cities? I’d love to hear your thoughts in the comments below.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Understanding the nature of IoT data

Guest blog post by ajit jaokar

This post is in a series Twelve unique characteristics of IoT based Predictive analytics/machine learning.

Here, we discuss IoT devices and the nature of IoT data

Definitions and terminology

    Business insider makes some bold predictions for IoT devices

    The Internet of Things will be the largest device market in the world.  

    By 2019 it will be more than double the size of the smartphone, PC, tablet, connected car, and the wearable market     combined.

    The IoT will result in $1.7 trillion in value added to the global economy in 2019.

    Device shipments will reach 6.7 billion in 2019 for a five-year CAGR of 61%.

    The enterprise sector will lead the IoT, accounting for 46% of device shipments this year, but that share will decline     as the government and home sectors gain momentum.

    The main benefit of growth in the IoT will be increased efficiency and lower costs.

    The IoT promises increased efficiency within the home, city, and workplace by giving control to the user.

And others say internet things investment will run 140bn next five years


Also, the term IoT has many definitions – but it's important to remember that IoT is not the same as M2M (machine to machine). M2M is a telecoms term which implies that there is a radio (cellular) at one or both ends of the communication. On the other hand, IOT means simply connecting to the Internet. When we are speaking of IoT(billions of devices) – we are really referring to Smart objects. So, what makes an Object Smart?

What makes an object smart?

Back in 2010, the then Chinese Premier Wen Jiabo once said “Internet + Internet of things = Wisdom of the earth”. Indeed the Internet of Things revolution promises to transform many domains .. As the term Internet of Things implies (IOT) – IOT is about Smart objects


For an object (say a chair) to be ‘smart’ it must have three things

  • An Identity (to be uniquely identifiable – via iPv6)
  • A communication mechanism(i.e. a radio) and
  • A set of sensors / actuators


For example – the chair may have a pressure sensor indicating that it is occupied

Now, if it is able to know who is sitting – it could co-relate more data by connecting to the person’s profile

If it is in a cafe, whole new data sets can be co-related (about the venue, about who else is there etc)

Thus, IOT is all about Data ..

How will Smart objects communicate?

How will billions of devices communicate? Primarily through the ISM band and Bluetooth 4.0 / Bluetooth low energy.

Certainly not through the cellular network (Hence the above distinction between M2M and IoT is important).

Cellular will play a role in connectivity and there will be many successful applications / connectivity models (ex Jasper wireless which primarily require a SIM card in the device).

A more likely scenario is IoT specific networks like Sigfox(which could be deployed by anyone including Telecom Operators).  Sigfox currently uses the most popular European ISM band on 868MHz (as defined by ETSI and CEPT), along with 902MHz in the USA (as defined by the FCC), depending on specific regional regulations.

Also, when 5G networks are deployed (beyond 2020) - Cellular will provide wide area connectivity for IoT devices

In any case, Smart objects will generate a lot of Data .


Understanding the nature of IoT data

In the ultimate vision of IoT, Things are identifiable, autonomous, and self-configurable. Objects  communicate among themselves and interact with the environment. Objects can sense, actuate and predictively react to events


Billions of devices will create massive volume of streaming and geographically-dispersed data. This data will often need real-time responses.

There are primarily two modes of IoT data: periodic observations/monitoring or abnormal event reporting.

Periodic observations present demands due to their high volumes and storage overheads. Events on the other hand are one-off but need a rapid response.

In addition, if we consider video data(ex from surveillance cameras) as IoT Data, we have some additional characteristics.

Thus, our goal is to understand the implications of predictive analytics to IoT data. This ultimately entails using IoT data to make better decisions.

I will be exploring these ideas in the Data Science for IoT course /certification program when it's launched.

Comments welcome. In the next part of this series, I will explore Time Series data 

Read more…


Guest blog post by Amogh Borkar

After using a lot of R for analytics projects believing that it was the best language for Data Scientists, I recently had the chance to pick up Python. R does seem a bit cumbersome when dealing with interfaces to other languages or to the web such as oauth. That was my motivation, to use Python to get text from the web and later process it in R, which was, I felt the "best" tool to go about.

However, Python surprised me not only with it's web interfacing abilities, but also with it's analytical features. It got me thinking, at a lot of points, why I was still using R when Python could do is so much more elegantly. 

So here are some points where I found Python really useful. In a way, this is my version of an answer to the question Python vs R:

1. Interfaces - Like I mentioned before, the number of interfaces and wrappers in Python are huge when you compare it to R. (E.g.: Apache Spark has a direct Python interface while with R, you'd need to configure a wrapper named SparkR.) In  some cases though, R is pretty good such as Jeff Gentry's twitteR package which is amazing. 

2. Handling Large Data - Now, this is one problem all R programmers face and everyone seems to talk about RAM at some point. One option is to use H2O...I didn't find it very easy to use though it's much easier than the typical Big Data frameworks. With Python, not only do you have more interfaces to big data, but also more options to read data or even a CSV line by line. It could be used to build amazing algorithms such as the one that google built for CTR prediction - Google's Whitepaper

3. The code - I remember people talking about the learning curve in R. With python, the syntax is so readable, it almost feels like it's given you the ability to run algorithm descriptions/ pseudo codes. Warnings are a bit limited though and you can still build infinite loops in Python. The data types in Python are a bit more primitive and you feel the need to have something like a Data frame. This is where the python package "pandas" comes in. Pandas gives you R - like (sometimes better) flexibility with Data frames. One thing I didn't like about Python though was it's interface with installing packages. You can't install it easily through any of the IDEs. They do have a package named "pip" which you could use from the command line. Also, unlike R versions, Python has a v2.7.x where most of the present packages run and a v3.4 where nothing really runs but they still want everyone to start using eventually.

4. The models - Finally, that's all that we praise R for really. The tools available such as random forests, gradient boosting, glm and gam. While these were built in R earlier, with python, you have the package "scikit - learn" which gives you all of these models (I haven't explored this exhaustively, but all the models I typically use are available in python). In addition, changing python source code is much easier in case you want to build custom models over those already built. What amazed me was the python visualizations. These are as good as R and you could also choose a custom plotting tool such as Qt as well.

Overall, I am now using Python at places where I find R cumbersome and have also started using it where it's convenient. I'm new to Python and have posted most of the good things I found about it. I'll possibly write about the shortcomings in subsequent blogs as I explore further.

Read more…

Guest blog post

Author: Ross Momtahan is writing on behalf of DCIM supplier, Geist

This year at Data Center World 2015 I was lucky enough to see a presentation from Paul Slater, Lead Architect of the Modern Data Center Initiative at Microsoft.

Paul spoke to us about the world that he works in – the world of hyperscale data centers.

To put into perspective the sort of scale we are talking about, Microsoft adds 3600 cores to their data centers every day.

If you’re struggling to visualise that then don’t worry because you’re in the same boat as me!

Working on this sort of scale has presented Microsoft with some difficult problems and they have had to change their approach over the years – learning a few lessons on the way.

Here are my top three takeaways from Paul’s presentation:

Be flexible and use standardised equipment

Start with the assumption that you are wrong! The fact that things very rarely go to plan is why Paul values flexibility so much.

Microsoft can fill their data centers within a few months and ensure that all equipment is of the same age, same generation and the same standards.

If you’re wanting to emulate this and don’t have the same resources then you should consider cordoning off your data center so that each part is filled with standard equipment.

Paul also stated that they have multiple designs that depend on the location of the data center – although they are not looking to move to Scandinavia as Facebook and Google have done.

Automation helps increase predictability

Warren Bennis once said “The factory of the future will have only two employees, a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.”

Whether or not you believe that quote to be true – I personally think that you could automate the feeding of the dog too – it’s certainly true that machines are more predictable than humans.

Microsoft strip people out of data centers as much as they (ahem) humanly can. One thing Paul said that struck me was that on their data center tours, the number of people on the tour usually outnumbers the total number of people working in the data center. In their largest data center they have 5 full time staff.

Build redundancies at hardware level and resilience at service level

This one is simple maths. If you have a service using two data centers with 99.9% availability then you can estimate the overall availability of the service to be 99.9999%. Surely this is much easier than trying to get that one data center up to 99.9999% availability? This principle can be applied on many levels.

Paul also discussed the choice other businesses might have to make between building a new data center, colocation, managed hosting and public cloud services. All are valid choices, with cloud services growing massively in popularity due to the financial flexibility that they offer.

On the flip-side, cloud services give you lots of financial flexibility but very little solution control.

Unfortunately for those who are more concerned with solution control than financial flexibility, the CFO is generally the person who wins these arguments and the CFO is more interested in financial control than solution control.

One other downside of using a public cloud solution that needs to be considered is that it affects usage patterns. If it’s extremely easy to consume cloud services and increase costs incrementally this is very likely to happen! This makes it harder to understand the total impact on IT budget even if the cloud solution seems cheaper on face value.

Overall I found the talk from Paul to be very engaging and insightful – if he’s speaking at a data center conference near you then I’d recommend you take out the time to check out what he has to say.

Read more…

Guest blog post by Nixon Patel

Beyond Internet of Things (IOT) or Industrial Internet of Things (IIOT)


Cisco has leap frogged the Internet of Things (IoT) with its all-encompassing abstraction which it calls the Internet of Everything (IoE). As per Dave Evans, the Futurist of Cisco, IoE technically differs from the Internet of Things (IoT) in that IoE encompasses the networks that must support all the data that (IoT) objects generate and transmit, while (IoT) is composed of connected objects only. Software and the objects at the edge by themselves do not get the job done; the entire ecosystem is required to make a meaningful business model.

Cisco further differentiates its definition of the Internet of Everything (IoE) as bringing together people, process, data, and things to make networked connections more relevant and valuable than ever before-turning information into actions that create new capabilities, richer experiences, and unprecedented economic opportunity for businesses, individuals, and countries.

This distinction represents one of the main ways the company has differentiated its software-defined networking strategy from those of its competitors. Connected, location-aware applications require more bandwidth, more intelligence on the edge of the network, new considerations for security and orchestration, and more cohesive, programmable infrastructure.

A common understanding needs to be arrived at before any real implementation can happen. Towards this, a first step has been taken as recently as this week by The Open Interconnect Consortium (OIC) and the Industrial Internet Consortium (IIC) by reaching an agreement focusing on the interoperability technology, as nothing can really be achieved unless universal standards are in place to standardize connections between disparate devices and systems that permeate the industry today. OIC currently has 50+ members, including Dell, HP, Siemens and Honeywell while IIC has 141 members including founding companies like AT&T, Cisco, IBM, GE and Intel.

Big Data platform’s ability to capture, store and process humongous data cheaply and then applying new analytics’ open source tools and statistical languages like R with built in machine learning algorithms will make analysis available in real or near real time. These new tech paradigms will get the right data to the right device at the right time and to the right person or machine to be able to make the right decision; even seemingly inane concepts like sensor-equipped garbage cans can produce billions of dollars in efficiency-based savings. In the case of smart trash bins, embedded sensors can reduce calls to waste management by allowing officials to see how full a can is, whether hazardous materials are inside, how pickup efforts can affect traffic patterns, and even whether a given garbage can's contents might contain a particularly offensive odor. These insights might not seem ground-breaking alone, but together, they add up to billions in savings.

The Smart grid, a project close to my heart as it could be an extension to my already operational dream project of setting up the 5 MW Solar Photovoltaic plant in rural India to alleviate the energy deficiency in the Indian subcontinent and in strong personal belief in clean energy. Energy efficiency, Smart Grid and Smart city projects have been in the works for years but pervasiveness and effectiveness touted by IoE seems to be pulled from science fiction, such as humans who live to be hundreds of years old thanks to, among other things, personalized medicine and better collection of biometric data through wearable technology and sensor-embedded household objects.

In context of my 5 MW Solar PV project, we have 50,000 panels in the entire 40 acre farm with each panel made up of 60X72 solar PV cells which generates data points with regards to amperage and voltage ( among other parameters) every second. This translates into 216 million data points every second or 103 Billion data point every day (assuming an average of 8 hour sunlight). With IoE I can now predict power generation for the entire 25 years of the plant life on any given day of the year at a given geolocation, with certain ambient temperature, humidity and panel tilt angle. This information can be fed into the national Smart Grid in real time and thus the entire state or country’s energy cost and supply can be predictive well in advance based on historical data. Energy trading platforms, pricing and energy derivatives traders will really love this and so would consumers as they would benefit from the optimizing of power generation cost. Similarly, the IoE can be applied across all verticals in supply chain management of human consumption and personal health.

Cisco seems to have figured the right strategy and believes Internet of Everything (IoE) will alter the trajectory of virtually every person on the planet, consumer and professional alike, whether it's smarter power grids, personalized retail experiences, improved industrial efficiency, or the ability to control the infrastructure of an entire building with a smartphone app, IoE will be "bigger than anything that's ever been done in high tech. Cisco believes that by 2020 IoE revenues will touch $19 trillion!

I personally believe IoE is going to revolutionize and disrupt the way we humans have lived on this planet for millennia. In my short span of 25 years as a technocrat and a serial entrepreneur I have seem at least 4 disruptive technology cycles, however the disruption that will be caused by IoE in conjunction with Big Data, Analytics, Cognitive computing and Deep Learning will surpass all the earlier disruptions by more than 100 folds. 

Read more…

BigData in Oil and Natural Gas industries

Guest blog post by VINU KIRAN .S

     In this article I am going to discuss about complete view of using big data and analytics techniques over vast geological survey data, as they are applied to the oil and gas industry. There is always a need for optimization in the oil and gas exploration and production which includes finding, locating, producing and distribution stages as the gathered data shows how data analytics can provide such optimization. This improves exploration, development, production and renewal of oil and gas resources.

     By using large-scale geological data, applying statistical and quantitative analysis, exploratory and predictive modeling, and fact-based management apporaches over those data, we can gather productive information for decision making also we can gather insights over oil and gas operations.

As found on internet, the three major issues that face the oil and gas industry during the exploration and production stages are:

Data management, These includes storage of a large-scale structured and unstructured of data that can be used for analysis and effective retrieval of information using analysis and statistical methods,

Quantification of data, Thiese includes the application of statistical and data analytics methods for making predictions and determining the insights over those predicted values.

Risk assessment, These includes predictive analysis of the gathered data with known risks that are realized using mathematical models to know how to properly deal with unknown risks

We know that Oil companies are using sensors that are located throughout the oil-field in a distributed manner, high-end communication devices, and data-mining techniques to monitor and track the field drilling operations remotely. The aim is to use real-time data to make better decisions and predict problems.

     As Oil is not found in big, cavernous pools in the ground. It resides in layers of rock, stored in the tiny pores between the grains of rock. Much of the rock containing oil is tighter than the surface on which your computer currently sits. Further, oil is found in areas that have structurally trapped the oil and gas – there is no way out. Without a structural trap, oil and gas commonly migrates throughout the rock, resulting in lower pressures and uneconomic deposits. All of the geological components play an important role; in drilling wells, all components are technically challenging. These data can be gathered pictographically or by means of sensor that results in a large-scale unstructured data.


     In these kinds of industries, organizations must apply new technologies and processes that will capture and transform a large-scale unstructured data such as geographical images, sensor data as well as the structured data into actionable insight to improve exploration and production value and yield while enhancing safety and protecting the environment. Oil-Well and field operations are equipped with sensor instruments to capture reading-data to get a 0view of equipment performance and well productivity data including reservoir, well, facilities and export data.

Leading, analytics-driven oil and gas organizations are connecting people with trusted information to predict business outcomes, and to make real-time decisions that help them outperform their competitors. Regardless of the wealth of data and content available today, decision makers are often starved for true insight.


     The processes and decisions related to oil and natural gas exploration, development and production generate large amounts of data. The data volume grows daily. With new data acquisition, processing and storage solutions – and the development of new devices to track a wider group reservoirs in a field, machinery and employees performance.

     While surfing on interner the following are three big oil industry problems that consume money and produce data, where the BigData, data mining and analytic techniques can give their better insights to reduce the risk factors:

1. Oil is hard to find. Reservoirs are generally 5,000 to 35,000 feet below the Earth’s surface. Low-resolution imaging and expensive well logs (after the wells are drilled) are the only options for finding and describing the reservoirs. Rock is complex for fluids to move through to the wellbore, and the fluids themselves are complex and have many different physical properties.

2. Oil is expensive to produce. The large amount science, machinery and manpower required to produce a barrel of oil must be done profitably, taking into account cost, quantity and market availability.

3. Drilling for oil presents potential environmental and human safety concerns that must be addressed.
Finding and producing oil involves many specialized scientific domains (i.e., geophysics, geology and engineering), each solving important parts of the equation. When combined, these components describe a localized system containing hydrocarbons. Each localized system (reservoir) has a unique recipe for getting the most out of the ground profitably and safely

So, we can conclude that the oil and gas industry has an opportunity to capitalize on big data analytics solutions. Now the oil and gas industry are in need for educating big data on the types of data the industry captures in order to utilize the existing data in faster, smarter ways that focus on helping find and produce more hydrocarbons, at lower costs in economically sound and environmentally friendly ways

Read more…

Guest blog post by William Vorhies

Summary:  Thanks to the IOT (internet of things) an internet-like experience of recommendations and awareness of your preferences is coming to the brick and mortar store near you.

You’ve probably noticed the huge difference in the tone of the conversation between data scientists and the general public over the issue of privacy and personalization.  The professional community is largely quiet but for the public you’d think we were developing bionic eyeballs tracking their most minute and private habits.

In my house my wife is always complaining that I can’t remember how many sweeteners she takes in her tea; who her favorite actors are, or whether she liked that Indian restaurant we visited last year enough to want to go back.  But if a web site shows her a picture of something she browsed yesterday, or if the recommended books and movies on Amazon are a little too on target she’s the first one to raise the hue and cry that her privacy is being violated.  My failing to remember – bad.  Their being helpful by remembering or recommending – also bad???

This is beginning to look like a real Catch 22.  Behaviors we wish for at home are suddenly evil if a web site does an even better job than your spouse at remembering your likes and dislikes.

Personally I think site personalization is a real blessing.  I don’t really want to see ads for rock climbing walls or baby diapers.  I’m not in that market so not being exposed to a random untargeted bunch of ads (think your Sunday paper – what’s a Sunday paper you say?) is all for the good.

Well web sites are one thing but these days with the emerging IOT our brick and mortar stores are gearing up to behave more like a web site and less like a random walk up one aisle and down another.  Here’s a brief update on who’s doing what in retail IOT.  I’m sure there are many providers I’ve missed and can’t say if these folks are good or bad at what they do but my hat’s off to them for trying something new that might make my life better even if my wife would find it a little spooky.

In retail Heat Maps (which products get picked up more often than others) and Flow Charts (how customers navigated the store) are all the rage.  Sensors also allow retailers to offer coupons over your smart phone that are tailored to your shopping pattern.  And by moving desirable merchandise with long linger times to better locations, frequently to deeper in the store, they can achieve that same ‘stickiness’ we associate with web sites to make us stay a little longer.  Where exactly are the customers going in the store, where do they pause and ponder, and how can the retailer use this information to revise the store layout, the merchandise displays, pricing, or anything else to squeeze out another dollar. 

The specifics of sensors and strategies differ from one vendor to another and in this early stage of adoption it’s fair to say that we’re waiting for the market to tell us which are most successful.  Some use your cell phone to triangulate your position, some use cameras, radio beacons, or even more exotic sensor types.  This is a good thing since all this experimentation will tell us what’s worth the investment and what’s not.  Any number of major retailers are running experiments. To name just a few:

Nordstrom – Euclid Analytics

Macy’s – Shopkick

Timberland and Kenneth Cole -Swirl Networks

Goldman’s Dept. Stores - RetailNext

The Future of Privacy Forum, a Washington, D.C., think tank, estimates that about 1,000 retailers are testing some sort of sensor strategy.

Swarm Solutions says 6,000 retailers have installed its door sensors to compare foot traffic with transactions.

Others working with Wi-Fi triangulation include Ekahau, Wifislam, and Prism Skylabs.  Apple’s iBeacon technology probably belongs in this group as well.

Blinksight and Insiteo are working with radio beacons.

Bytelight, Aisle411, Everyfit, and PointInside are all working with other sensor types including embedded floor sensors and even LED lights.

These 15 innovators are probably only the tip of the iceberg.  This is one of those ‘stay tuned for results’ stories.  The results aren’t in but there are lots of horses in the race.  Meanwhile, I’m still looking for the sensors I can install at home that will make my wife think I am a better husband.

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.


About the author:  Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

The original blog can be viewed at:

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Data Science Jobs