Subscribe to our Newsletter

All Posts (76)

Sort by
In this article we’ll use the information we can derive from the structure of the data coming from Mailchimp, to craft new variables or features that hopefully will help us optimize our marketing campaigns. We will attempt to achieve the following: Figure out a way of identifying the gender of our user Segment the e-mails of our users in different categories, for example if an e-mails address is personal or a business address. Segment the events of our recipients into different periods of the day.
Read more…

Originally posted on Data Science Central

Internet of Things (IoT) is a network that is embedded with sensors, electronics, network connectivity, and software enabling physical objects to gather and exchange data. As IoT rises into dominance, sensors are playing a pivotal role for measuring the physical quality of objects and enumerating it into a value, which can be read by another device or user. IoT devices are equipped with sensors which are capable of registering changes in pressure, temperature, motion, light, and sound. In the physical world, more and more objects can now communicate with each other through embedded IoT sensors, actuators, and tags.

The Iot Sensors Market was valued at US$4.90 bn in 2014 and is expected to reach US$34.75 bn by 2023, growing at a CAGR of 24.5% during forecast period.

A surge in demand for IoT sensors in the automotive industry and the booming Industrial Internet of Things (IIoT) market are strongly driving the growth of the IoT sensors market. Furthermore, increase in demand for consumer electronics and growing demand for smart devices is boosting the market growth. Smart devices account for a considerable portion of consumer electronics.

Based on types, the global market can be segmented into accelerometers, gyroscopes, magnetometers, pressure sensors, temperature sensors, light sensors, others. Temperature sensors acquired the largest share of the market in 2014. Temperature sensors are adopted in a diverse range of applications including various wearable devices for fitness programs and for monitoring health, smart homes and in other industrial applications for monitoring weather changes. Therefore, the growing demand for these sensors from various industrial end-users is boosting the growth of this market.

By applications, the global IoT sensors market is segmented into consumer electronics, healthcare, automotive, industrial, building automation, retail and others. Consumer electronics was the largest contributor to the market in 2014. Increasingly consumer electronics such as smart home appliances and smart TV sets are adopting the approaches of IoT and getting connected. In entertainment electronics, IoT sensors help users to establish flexible media usage. Additionally, rising awareness among consumers and growing demand for affordable consumer electronics has given rise to favorable conditions for the consumer electronics market in developing economies such as India, Middle East and Africa. This is set to offer promising growth opportunities to the global market in the coming years.

Stringent government regulations and policies across the globe are encouraging the development of “smart cities” and this is offering a potential growth opportunity to the global IoT sensors market. IoT sensors would be used in smart cities in smart meters, smart grids, intelligent traffic management systems, and smart parking among others. Further, technological advancements in the medical industry are set to offer a substantial opportunity for the growth of the IoT sensors market. The deployment of healthcare devices using IoT sensors could transform the healthcare industry by focusing on better patient care, lowering costs, and increasing efficiency.

Some of the major players in the IoT sensors market are: Infineon Technologies (Germany), STMicroelectronics N.V. (Switzerland), IBM (U.S.), Robert Bosch GmbH (Germany), Honeywell International Inc. (U.S.), Ericsson (Sweden), InvenSense Inc. (U.S.), Libelium (Spain), ARM Holdings Plc. (U.K.) and Digi International Inc. (U.S.) among others.  

Click here for more information. 

Read more…


Guest blog post by Gabriel Lowy

As a central repository and processing engine, data lakes hold great promise for raising return on data assets (RDA).  Bringing analytics directly to different data in its native formats can accelerate time-to-value by providing data scientists and business users with increased flexibility and efficiency. 

But to realize higher RDA, data lakes still need governance life vests.  Without data governance and integration, analytics projects risk drowning in unmanageable data that lacks proper definitions or security provisions. 

Success with a data lake starts with data governance.  The purpose of data governance is to ensure that information accessed by users is consistently valid and accurate to improve performance and reduce risk exposure. 

Data governance is a team sport.  Collaboration among and between data scientists, IT and business teams define the use cases that the architecture and analytics software will support. 

Understanding business and technical requirements to identify the value data provides is the first step to developing a data governance cycle.  Data governance establishes guidelines for consistent data definitions across the enterprise.  It also defines who has access to specific data and the purposes of usage. 

Without data governance, it’s impossible to know whether the information presented is accurate, how and by whom it has been manipulated.  And if so, with what method, and whether it can be audited validated or replicated.  As departments maintain their own data – often in spreadsheets – and increasingly rely on outside data sources, a verifiable audit trail is compromised, exposing the firm to compliance violations. 

Including security teams in data governance is also crucial. By understanding what data will be brought into the data lake and the user access permissions, security teams can better understand potential risks.  They can build stronger protection around critical data assets while becoming more resilient and responsive to incidents.


Gaining Context in a Data Lake


As the data lake becomes the repository for more internal and external data, IT must integrate the data lake with the existing infrastructure.  One of the benefits of a data lake is that it can ingest data without a rigid schema or manipulation.  Integration reduces errors and misunderstandings, resulting in better data management.

Modern data integration technologies automate much of the data quality, cataloguing, indexing and error handling processes that often encumber IT teams.  Metadata management is all the more critical in a data lake.  Companies need to manage diversity of terminology and definitions by maintaining strong metadata while providing users with the flexibility to analyze data using modern tools. 

A semantic database to manage metadata provides context into what’s in the data lake and its interrelationships with other data.  This goes beyond the basic capabilities of the Hadoop Distributed File System by making data query more organized and systematic.

To facilitate integration, companies may also consider partitioning clusters into separate tiers.  These tiers can be based on data type and usage, or based on an aging schedule (i.e. current, recent, archive).  Each tier can be assigned different classifications and governance policies based on the characteristics of the different data sets. 



The scale, cost and flexibility of Hadoop allows organizations to integrate, catalogue, discover and analyze more types of data than ever before at faster speeds.  But a data lake is not a panacea for data management. 

Data governance is the key to success with data lakes.  Understanding the business use cases facilitates sound technical decisions.  It enables companies to integrate historical data with newer big data formats without the need for traditional ETL (extract, transform, load) tools. 

However, vigilance around data quality, context and usage are essential.  It’s a key to eliminating organizational and departmental data silos – one of the primary objectives of a data lake.  It also instills more confidence in users that the data they are working with is trustworthy.  Such confidence results in more reliable and predicable models and decision outcomes. 

A strong data management platform ensures users can access data assets as they need it, share information when required, and have the tools to “see” analytics results without the pre-definitions of restricted data sets inherent in legacy business intelligence platforms. 

The elegance of analytics systems and processes lies not in the gathering, storing and processing of data.  Today’s technology has made this relatively easy and common.  Rather it is in the integration and management to provide the highest quality data in the timeliest fashion at the point of decision – regardless of whether the decision maker is an employee or a customer.  

Data lakes hold the promise of realizing higher RDA by making data more valuable.  The more valuable a company’s data, the higher it’s RDA.  And higher data efficiency strengthens competitiveness, financial performance and company valuation.


Gabriel Lowy Technology Content Writer

Image: SwimUniversity


Read more…

Big data in agriculture

Guest blog post by Brian Rowe

The Data-Driven Weekly is kicking off 2016 by exploring how big data and analytics is powering data-driven business in different industries. First off is the world of agriculture. While data has always played a prominent role in agriculture and ranching, the explosion of cheap sensors and data storage means that every aspect of agriculture can now be measured and optimized.

Possible Futures

According to AGCO (machinery manufacturer), there are “two separate data ‘pipelines’ for [their] customers’ data to flow through – one for machine data and one for agronomic data.” John Deere has a similar vision that focuses on “sensors added to their equipment to help farmers manage their fleet and to decrease downtime of their tractors as well as to save on fuel.” Apparently they combine the sensor data with real-time weather and agronomic data on their portal. While all this sounds interesting, the vision appears a bit anachronistic, relying on dashboards and human drivers. We can see this in their “imagined future” video, where the farmer sits at his desk sipping coffee instead of checking the crops by hand.

Note: If the video is not rendering, view the original article.

I’m assuming that John Deere and the other big manufacturers don’t actually believe that with all this kit humans will still be at the wheel of tractors and combines, but they don’t want to scare their customers into thinking their jobs will be automated away. So baby steps. Human drivers aside, too much of John Deere’s vision (if we take the video at face value) is predicated on human decision making and intervention. One thing going for John Deere is that they use R for their models.

Monsanto, on the other hand, sees a slightly different future. Their Climate Corp subsidiary focuses “data prediction models that draw on a range of field and climate variables in order to guide the farmer’s delivery of inputs like nitrogen for optimum crop production.” Judging from the simulation description, they are doing a Monte Carlo analysis to optimize crop performance.

For those that shudder at the thought of Monsanto having an even tighter grip on the food supply [1], fear not. The International Centre for Tropical Agriculture (CIAT) offers an alternative via the WorldClim dataset, which provides an open/free “set of global climate layers (climate grids) with a spatial resolution of about 1 square kilometer”. This enables farmers to “optimize crop yields by adjusting their management practices to subtle variations in growing conditions across sites and over time in a given area.

Data Ownership

Speaking of all this data, the natural question of ownership arises. According to AGCO, they believe that “the farmer owns his or her data, and it is up to us leaders in the industry to help them access, process and utilize it.” For others, it’s not so clear. In November, John Deere announced a partnership with the Climate Corp to automatically collect and share agronomic data with the Climate cloud. Touted as a convenient way to get data-driven insights, it’s unclear who owns the data once it is pushed to Climate Corp’s cloud. At a congressional hearing on big data in agriculture last October, the President of the Missouri Farm Bureau said that “farmers should understand what will become of the data collected from their operation“, including who has access to it and for what purposes it can be used. From the farmer’s perspective, they “must do everything we can to ensure producers own and control their data, can transparently ascertain what happens to the data, and have the ability to store the data in a safe and secure location.” It will certainly be interesting to see how this plays out, particularly between developed and developing nations.


Those interested in exploring this area can get started with some of the following datasets. In addition to the WorldClim dataset, the SPADE dataset provides soil property data for Europe. For machinery compatibility, there is the AEF Database, provided by the Agricultural Industry Electronics Foundation.

Feel free to add more resources in the comments.


[1] Read a letter from the CEO of Climate Corp for an alternative perspective on Monsanto

Brian Lee Yung Rowe is Founder and Chief Pez Head of Pez.AI // Zato Novo, a conversational AI platform for guided data analysis and Q&A. Learn more at Pez.AI.

Read more…

Maximizing Data Value with a Data Lake

Contributed by Chuck Currin of Mather Economics:

There’s tremendous value in corporate data, and some companies can maximize their data value through the use of a data lake. This assumes that the adopting company has high volume, unstructured data to contend with. The following article describes ways that a data lake can help companies maximize the value of their data. The term “data lake” has been credited to James Dixon, the CTO of Pentaho. He offered the following analogy: 

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.”

Rapid Data Application Development

In practice, a data lake can be a very useful platform for fishing out new data applications. In particular, the functionality that Hadoop’s HDFS provides for storing a mixture of structured and unstructured data, side by side, is a game changer. Data analysts can utilize Hive or HBase to put queryable metadata on top of the unstructured data to provide the ability to join very disparate data sources. Once there’s structure in place, data analysts can run queries and machine learning algorithms against the data in an iterative fashion to gain further insights.  Additionally, R, Stata or Python can be used for further statistical analysis for additional insights. This methodology enables data organizations to quickly develop churn models, survival models, to do Monte Carlo simulations and other advanced analytics.

Not only is the data lake very useful for data discovery, it is also a great platform for rapid data application development. Analysts use this platform to experiment through the use of queries and statistical modelling to develop new data applications iteratively. Due to the low cost of data storage and computing resources, you can scale out a Hadoop cluster on commodity hardware as necessary, and accommodate massive data volumes. Also, due to HDFS’s lack of schema, it can handle any file format or type. Non-relational data formats such as machine logs, images and binary can be stored in your data lake. You can also store structured, relational data as well. This comes with the added benefit of being able to store compressed data on HDFS, and leave it in a compressed state while querying over it. Along with the fiscal advantages of using the pooled computing resources of the data lake, it is also highly advantageous to be able to quickly integrate disparate data sets into one place where advanced analytics can be applied.


Data Lakes Complement Data Warehouses

The data lake isn’t a replacement for a data mart or data warehouse. The functionalities of the data lake and the data warehouse are complementary. In the data warehousing world, there are Extract, Transform and Load (ETL) processes that feed the system. By contrast, on the data lake side, there are typically Extract, Load and Transform (ELT) processes. The juxtaposition of the letters in data lake’s loading acronym is due to the fact that you indiscriminately load data into the data lake. So, it’s Extract, and Load first, and then the Transformation step happens later. In the data warehouse paradigm, where there’s a relational database involved, the data is Extracted, Transformed and then Loaded. As part of that ETL, there’s data staging and cleansing, and then the data is loaded. Data warehousing provides temporal context around the various aspects of your business. That context cannot be replaced by a data lake.


Maximizing Your Data’s Value

Technology managers are always looking to maximize the ROI of their data projects. By indiscriminately pooling their corporate data into a data lake, there are more opportunities to recoup their technology investments. Taking advantage of commodity hardware, and the ease of integrating disparate data sources, data organizations are able to maximize the value of their data through continuously improving their business models and rapidly developing new data applications. Statistical models are likely to take multiple iterations to optimize. Having a data lake will facilitate rapid integration of additional metrics to statistical models. The data lake can also feed a traditional data warehouse, or it can load data from the warehouse to do a “mashup” against unstructured, non-relational data. Finally, the data lake’s place in the data organization is potentially huge. The ability to keep all historical data, to do complex statistical modelling, to create new data applications and to enhance the data warehouse, a company can continuously innovate and maximize the business value generated from its data.

Originally posted on Data Science Central

Read more…

9 Key Benefits of Data Lake

Guest blog post by Kumar Chinnakali

A Data Lake has flexible definition, to make this statement true the dataottam team took initiative and released a eBook called “The Collective Definition of Data Lake by Big Data Community”, which contains many definitions from various business savvy and technologist.

And in nutshell Data Lake is a data store and processing data system, where an organization can place internal data, external data, partner’s data, competitor data, business process, social data, and people data. Data Lake is not Hadoop. And it leverages the Store-All principle of data. Data Lake is scientist preferred data factory.

  1. Scalability – It is the capability of a data system, network, or process to handle a growing amount of data or its potential to be enlarged in order to accommodate that data growth. One of the horizontal scalability tools is Hadoop, which leverages the HDFS storage.
  2. Converge All Data Sources – Hadoop powered to store the multi structured data from diverse set of sources. In simple words the Data Lake has ability to store logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
  3. Accommodate High Speed Data – In order to have the high speed data in the Data Lake, it should use few of the tools like Chukwa, Scribe, Kafka, and Flume which can acquire and queue the high speed data. By leveraging this high speed data can integrate with the historical data to have its fullest insights.
  4. Implant the Schema – To have insights and intelligence from the data, which is stored in the Data Lake we should implant the schema for the data and make the data flow in analytical system. And the data lake can able to leverage both structured and unstructured data.
  5. AS-IS Data Format – In legacy data system the data is modeled as cubes at the time of data ingestions or ingress. But in the data lake removes the need for data modeling at the time of ingestion; we can do it in the time of consuming. It offers unmatched flexibility to ask any business, domain questions and to seek insights and intelligence answers.
  6. The Schema – The traditional data warehouse will not support schema less storage. But the Data Lake leverages the Hadoop simplicity to store the data based on schema less write and schema based read mode, which is very much useful at the time of data consumptions.   
  7. The favorite SQL – Once the data is ingresses, cleansed, and stored in a structured SQL storage of the Data Lake, we can reuse the existing PL-SQL/DB2 SQL scripts. The tools such as HAWQ, Impala, Hive, and Cascading gives us the flexibility to run massively parallel SQL queries while simultaneously integrating with advanced algorithm libraries such as MLlib, MADLib and applications such as SAS. Performing the SQL processing inside the Data Lake decreases the time to achieving results and also consumes far less resources than performing SQL processing outside of it.
  8. Advanced Analytics: Unlike a data warehouse, the Data Lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms to recognize items of interest that will power real-time decision analytics.

Originally posted here.

Read more…

9 Key Benefits of Data Lake

Guest blog post by Kumar Chinnakali

A Data Lake has flexible definition, to make this statement true the dataottam team took initiative and released a eBook called “The Collective Definition of Data Lake by Big Data Community”, which contains many definitions from various business savvy and technologist.

And in nutshell Data Lake is a data store and processing data system, where an organization can place internal data, external data, partner’s data, competitor data, business process, social data, and people data. Data Lake is not Hadoop. And it leverages the Store-All principle of data. Data Lake is scientist preferred data factory.

  1. Scalability – It is the capability of a data system, network, or process to handle a growing amount of data or its potential to be enlarged in order to accommodate that data growth. One of the horizontal scalability tools is Hadoop, which leverages the HDFS storage.
  2. Converge All Data Sources – Hadoop powered to store the multi structured data from diverse set of sources. In simple words the Data Lake has ability to store logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
  3. Accommodate High Speed Data – In order to have the high speed data in the Data Lake, it should use few of the tools like Chukwa, Scribe, Kafka, and Flume which can acquire and queue the high speed data. By leveraging this high speed data can integrate with the historical data to have its fullest insights.
  4. Implant the Schema – To have insights and intelligence from the data, which is stored in the Data Lake we should implant the schema for the data and make the data flow in analytical system. And the data lake can able to leverage both structured and unstructured data.
  5. AS-IS Data Format – In legacy data system the data is modeled as cubes at the time of data ingestions or ingress. But in the data lake removes the need for data modeling at the time of ingestion; we can do it in the time of consuming. It offers unmatched flexibility to ask any business, domain questions and to seek insights and intelligence answers.
  6. The Schema – The traditional data warehouse will not support schema less storage. But the Data Lake leverages the Hadoop simplicity to store the data based on schema less write and schema based read mode, which is very much useful at the time of data consumptions.   
  7. The favorite SQL – Once the data is ingresses, cleansed, and stored in a structured SQL storage of the Data Lake, we can reuse the existing PL-SQL/DB2 SQL scripts. The tools such as HAWQ, Impala, Hive, and Cascading gives us the flexibility to run massively parallel SQL queries while simultaneously integrating with advanced algorithm libraries such as MLlib, MADLib and applications such as SAS. Performing the SQL processing inside the Data Lake decreases the time to achieving results and also consumes far less resources than performing SQL processing outside of it.
  8. Advanced Analytics: Unlike a data warehouse, the Data Lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms to recognize items of interest that will power real-time decision analytics.

Originally posted here.

Read more…

Guest blog post by Randall V Shane

The figure titled "Data Pipeline" is from an article by Jeffrey T. Leek & Roger D. Peng titled, "Statistics: P values are just the tip of the iceberg. These are both well known scientists in the field of statistics and data science, and for them, there is no need to debate the importance of data integrity; it is a fundamental concept. Current terminology uses the term "tidy data", a phrase coined by Hadley Wickham from an article by the same name. Whatever you call it, as scientist, they understand the consequences of bad data. Business decisions today are frequently driven by results from data analysis, and, as such, this requires today's executives to also understand these same consequencese. Bad data leads to bad decisions. 

Data Management Strategy

Ok, case closed. There is nothing more to discuss, or debate. Right? On the surface, this is an obvious conclusion, and you would think there would be no need to discuss it any further. I have been accused of having a stranglehold on the obvious on more than one occasion. However, if this is so obvious, why, in my 20+ years of working as an information architect and data engineer, do I continue to see bad data? When I am engaged to help a company with their data, the first thing I should be handed is documentation that defines the company's data management strategy. However, this has seldom happened (of course, they probably wouldn't need me if they handed me their data management strategy documentation).

Typically, the first thing I do is obtain access to "at least one" of their major databases and reverse engineer it using a tool like Erwin to see how they are managing their most important data. Invariably, I see very nicely arranged rows, and columns of 100's of tables without a single relationship, or primary key assigned. In addition, you hardly ever find a data dictionary. If you have a question about the data, generally you are required to schedule an appointment with a very busy individual in the company that is the keeper of this data and considered the local subject matter expert.

I quoted "at least one" in the paragraph above to highlight that there are usually several major databases, and numerous lesser databases. Just the fact that there are numerous databases siloed throughout the company is a good indicator that there is a lot of work to do. There are large companies that literally have thousands of data stores of duplicated data, and a massive ETL team that is busy moving data from one database to another.

Relative to Big Data?

This is a "Big Data" forum, so what does this discussion have to do with big data? If your company is anything like described above, then you are not ready to manage a big data project. Organizations that successfully implement a big data strategy have a documented corporate data management strategy, and big data is simply a part of the overall strategy to properly manage this valuable asset. We have all heard of the failed Big Data projects, and there are numerous reasons for that. The lack of a corporate data management startegy, and just a general lack of understanding of data management can explain most failed projects.

Data Governance
When data is received from a third party, as was discussed in my previous post,Data Integrity: The Rest of the Story Part II, there needs to be a process in place for managing this data upstream. A huge mistake, in my opinion, is to put data into a Data Lake, or any other type of data store without putting it through the learning and cleansing process. It is far too easy to rationalize shortcuts, and it is far too difficult to justify revisiting the same work. Clean the data upstream before it is allowed in your data stores, and then the manipulation and analysis of that data will always serve its purpose.

This gets into the subject of Data Governance, and Data Quality management (a component of a Data Management Strategy). We will leave this for another forum discussion, but I didn't want to discuss the data pipeline, and data integrity without at least mentioning the key component of governance.

NOAA Storm Data Analysis

Using the example from my last post regarding the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, if NOAA had properly maintained their data, it would be of far more value to the consumer than it is in its current state. Let me restate that I am not picking on NOAA. These examples are everywhere and in many cases much worse than NOAA's storm database. Since my last post, I have worked on this data very little, and in this short time I found more errors than were discussed last month. Not only were there 2013 duplicates in the FATALITIES data, and 28 FATALITIES records without a storm, but there are 28,332 LOCATION records that refer to a storm event that does not exist in the DETAILS table.

As you will recall from the previous post, the problems identified in NOAA's storm database included:

  1. Transitive Dependency 
  2. Lack of integrity constraints 
  3. Lack of referential integrity 
  4. Data not normalized (duplications)
  5. Sloppy data management practices

The last problem area identified in the NOAA data is exactly why you need referential integrity on a normalized data set with properly defined constraints. All of these controls protects your data from sloppy data management practices. The duplicates discovered in the NOAA data were probably caused during the data cleansing process they went through in 2014. The file dates where the duplicates occurred were all from 1997 through 2014, and was probably a process where human error introduced the duplicates during a batch update. This happens to the best of us, and it is exactly why we need stringent controls on our data -- to protect us from ourselves.

The errors discussed in this data set are not complete. I addressed some of the major issues, and the quality of the data is greatly enhanced at this point. However, as I started developing exploratory plots to demonstrate the errors, it was discovered that there are 248,982 records in the LOCATIONS table that contain no values for the LATITUDE and LONGITUDE, and of these, 187,367 records contain no value for location (most of those with a LOCATION are very general, like CountyWide, yet there is no County name provided). I guess my question here is, why create a LOCATION record with no LOCATION?

Trustworthy Data?

The example in the previous paragraph highlights the importance of understanding your data. Someone could mistakenly think that there are valid locations for 1,001,608 storms, when in reality, it is closer to 724,294. Regarding the LOCATION data, violation of normalization rules is that there is LAT/LONG data in the DETAILS table, as well as the LOCATIONS table. Which one is correct? Do the LAT/LONG provided in the DETAILS table match those in the LOCATIONS table? I will leave that for someone like NOAA to fix. Lastly, the existing LAT/LONG are not all in the correct format, nor are their values within a valid range. All longitudes East of the Prime Meridian should be negative numbers, yet the longitudes are positive for LON2 variables in LOCATIONS table. The range of valid latitude and longitudes values for the 48 contiguous United States are:

+48.987386 is the northern most latitude 
+18.005611 is the southern most latitude 
-124.626080 is the west most longitude 
-62.361014 is a east most longitude

In the NOAA LOCATIONS table, the range of values for the beginning latitude of the storms are:

Latitude Range = -14.46 to 97.10
Longitude Range = -170.92 to 171.37

And for the ending location of the storm, the value ranges are:

LAT2 Range = -1427 to 6457000
LON2 Range = -17122 16012816

As you can see, these are far outside of the ranges for the United States. Let's quickly take a look a the NOAA DETAILS range of values for the beginning and ending latitudes and longitudes:

Range of values for the beginning latitudes = "" to "REDLANDS" (that's correct - these are values in the latitude columns of the details table.
Range of values for the beginning longitudes =  "" to " FL."

Range of values for the ending latitudes = "" to "RALEIGH"
Range of values for the ending longitudes = "" to " APEX"

The values are quite different once the data has been coerced into being the correct data type. Just showing the BEGIN_LAT ranges, instead of a range from "" to "REDLANDS", you get a range of values from -14.4 to 70.5. Still not all exactly in the United States. Values less than 0 for latitude would be somewhere South of the Equator, but we can assume that the 70th parallel is somewhere up in Alaska (you can Google the 70th parallel and it goes through the Arctic, and the Northern tips of Alaska. Positive values for longitude would be East of the Prime Meridian (somewhere in Europe).  

Nevertheless, this once again would be a very simple thing to control with a numeric data type field, and a domain constraint on the acceptable range of values. 

To Normalize or Not, that is the question!

Over the years I have had many debates with colleagues over the value of a rigidly controlled database versus a loosely controlled database. I think what some fail to understand is that well defined standards and structures enable flexibility, extensibility, reuse, and resilience of a database. I like to think of it in terms of plug-and-play. The hardware standards that evolved in the 90's revolutionized the computer industry, and it was all because of very well defined standards. Who recalls trying to find a sound card that would work with your Compaq computer?

The same applies to data management best practices and standards; they enable flexibility. Relational data maintained at the ATOMIC level, can easily present data in multiple views of the same data, dependent on the requirements. Aggregate, dimensional, and fact tables can be created as views, and then easily modified to accommodate changing requirements. The same applies to Big Data analysis. 

In my opinion, there are very few legitimate reasons for removing constraints, indexes, and denormalization. Complexity and performance are the two arguments I hear most frequently, and neither hold water in my opinion. Yes, there are use cases where some exceptions are made, but they are few and far between. I also believe that

your data has to get really big, and come at you really fast to warrant non-standard data practices.

Big Data is simply a lot of small data. That said, there are legitimate use cases for a Hadoop platform (e.g., sensor data, clickstream analysis, realtime predictive analysis, AKA Amazon, and Netflix), but my personal opinion is that integrated platforms with a reputable RDBMS like Oracle, is the way to go. My next post will be on the topic of using Oracle R Advanced Analytics for Hadoop on Oracle's Big Data platform.

To wrap up this series of posts on data integrity, I hope you have taken something positive away from the discussion, and if nothing else have a greater appreciation for the importance of data integrity, as well as a better understanding of what is involved in maintaining clean data. And please, if you disagree, or find errors, or whatever, please feel free to leave a comment. I would love to hear your opinion on the subject. I will provide a link to the code, exploratory plots, and maps on RPubs later this week. 

Read more…

Guest blog post by Vincent Granville

This is an interesting resource for data scientists, especially for  those contemplating a career move to IoT (Internet of things). Many of these modern, sensor-based data sets collected via Internet protocols and various apps and devices, are related to energy, urban planning, healthcare, engineering, weather, and transportation sectors. 

Sensor data sets repositories

For more on IoT and sensor data, visit, or read  The 10 Best Books to Read Now on IoT. For other free data sets repositories, click here or visit the links mentioned below

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Bernard Marr

The biggest threat to your job might come from an unexpected place. I believe that there is a hidden assassin lurking in the background waiting to finish you off in your job. Let’s face it; job security is high on everyone’s wish list right now, especially at times of economic hardship when it might not be so easy to quickly find another one.

So how secure is your job? Where is this hidden threat coming from and how can you put yourself into the best possible position to keep your job? I believe the biggest threat to your job, indeed most of our jobs, is coming from an unforeseen eliminator. I believe that our improved ability to capture and analyze data will allow us to automate most jobs. And I am not just talking about the manual and un-skilled jobs but any job, including the jobs of knowledge workers, doctors, journalists and even sports coaches.

I don’t blame you if, at this point, you might think ‘what the heck are you talking about?’ So, let me give you some examples that should make it clearer, but be aware, they might send a few cold shivers down your spine.

  • Taxi-Drivers: When I was in the back of a car driving from San Francisco airport to Silicon Valley I noticed Google’s self-driving car on the road. I said to the diver: “Hey, check this out. The car we just passed has no driver in it. It’s Google’s self-driving car and stays on the road safely by analyzing a gigantic amount of data from sensor and cameras in real time”. His reply: “So that means that Google will take away my job soon”. This made me think a little more about the fact that our ability to process data will have an impact on so many jobs and my thoughts continued on that journey.
  • Border Control Agents: When I went back to the airport to catch my plane to London I used the electronic passport machines. You put your passport it, it scans it, and then scans your face to see whether they match. Then the doors open and you go through immigration. No human contact and no need for border control agents any more. The machines do a better and more reliable job.
  • Pilots: We know that autopilots have been assisting pilots to fly planes for many years. However, the latest commercial airlines are now able to fly the plane unaided. They can take off and land you safely (and arguably safer than humans as most air disasters are down to human error). We just have to look at the military where unmanned aircrafts (so called drones) are taking over. Fighter jet pilots will be Air Force history soon.
  • Doctors: Robotic tools are already assisting surgeons to perform operations and doctors use large-scale databases of medical information to inform their decisions. However, I can imagine a scenario where a full body scanner takes a complete 3D image of you and where robots will perform an operation completely un-assisted. We now have the technology and computing power to perform surgery without the need for humans. And therefore without the risk of human error. Supercomputers will be able to make a solid diagnosis based on all previous medical knowledge (as well as data from your own medical history, DNA code, etc.), again without the input from human doctors.
  • Nurses: We can now buy diapers that tweet you when your baby needs to be changed. The latest evolution of this is to include diagnosis technology in diapers that analyze the urine and alert us to any abnormalities. Another example is a hospital unit that looks after premature and sick babies. The unit is now applying real time analytics based on a recording of every breath and every heartbeat of all babies in their unit. It then analyses the data to identify patterns. Based on the analysis the system can now predict infections 24hrs before the baby shows any visible symptoms. This allows early intervention and treatment that is so vital in fragile babies. With the advances in wearable technology and smart watches we will be able to monitor all aspects of our health 24hours a day. What jobs are left for nurses?
  • Customer Support Agents: We all know about the irritating automated answering systems in call centers that give you options and then route your call to the supposedly right person. What we are now seeing is the rise of natural language systems that are able to have a conversation with humans. IBM has developed Watson – a computer that recently challenged two of the all-time best Jeopardy! players. Without access to the Internet, Watson won the game by interpreting natural language questions and answering back after analyzing its massive data memory (that included a copy of the entire Wikipedia database). This means that when you ring any call center you will always speak to the ‘right person’ – only that the person is a machine instead.
  • Sports Coaches: We can now buy baseballs with sensors in them that send back information to your smart phone. There you can get the analysis and feedback of how to improve your game. Football and baseball teams already use cameras and sensors to track and analyze the performance of every player on the field, at any given point in time. For example, the Olympic cycling team in the UK uses bikes that are fitted with sensors on their pedals and collect data on how much acceleration every push on the pedal generates. This allows the team to analyze the performance of every cyclist in every race and every single training session. In addition, the team has started to integrate data from wearable devices (like smart watches) the athletes wear on their wrist. These devices collect data on calorie intake, sleep quality, air quality, exercise levels, etc. The latest innovation now is to integrate an analysis of social media posts to better understand the emotional states of athletes and how this might impact track performance.
  • Journalist: A company called Narrative Science recently launched a software product that can write newspaper stories about sports games directly from the games’ statistics. The same software can now be used to automatically write an overview of a company’s business performance using information available on the web. It uses algorithms to turn the information into attractive articles. Newspapers of the future could be fully automated.

I think you are getting the picture, right? Some of these examples might paint pictures of the future, while others are already here and are redefining our job market as you read this.

I can’t think of many jobs that we can’t automate using big data analytics, artificial intelligence and robots. So where does this leave us and our jobs? Will we all become programmers? No. Could we all simply not work and let the machines do our jobs? Unlikely. I find this all a little scary but at the same time trust in our human ability to adapt. We managed to adapt during the industrial revolution when we moved from farm work to industrial labor. We also adapted when we moved from the industrial era to the knowledge economy.

What's clear, however, is that there is a call to action. You need to ensure you advance your career in a way that positions you at the forefront of these developments and that you stay away from jobs that will be the first to go. Overall, I am excited to see how we adapt to the world of big data robots!

What do you think? Please let me know your thoughts. Are you scared or excited? How do you think our world will change with the emergence of big data robots?

AboutBernard Marr is a globally recognized expert in strategy, performance management, analytics, KPIs and big data. He helps companies and executive teams manage, measure, analyze and improve performance. His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance.

Read more…

Guest blog post by Bernard Marr

The term ‘Big Data’ is a massive buzzword at the moment and many say big data is all talk and no action. This couldn’t be further from the truth. With this post, I want to show how big data is used today to add real value.

Eventually, every aspect of our lives will be affected by big data. However, there are some areas where big data is already making a real difference today. I have categorized the application of big data into 10 areas where I see the most widespread use as well as the highest benefits [For those of you who would like to take a step back here and understand, in simple terms, what big data is, check out the posts in my Big Data Guru column].

Detection of Earth-like planets uses big data

1. Understanding and Targeting Customers

This is one of the biggest and most publicized areas of big data use today. Here, big data is used to better understand customers and their behaviors and preferences. Companies are keen to expand their traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models. You might remember the example of U.S. retailer Target, who is now able to very accurately predict when one of their customers will expect a baby. Using big data, Telecom companies can now better predict customer churn; Wal-Mart can predict what products will sell, and car insurance companies understand how well their customers actually drive. Even government election campaigns can be optimized using big data analytics. Some believe, Obama’s win after the 2012 presidential election campaign was due to his team’s superior ability to use big data analytics.

2. Understanding and Optimizing Business Processes

Big data is also increasingly used to optimize business processes. Retailers are able to optimize their stock based on predictions generated from social media data, web search trends and weather forecasts. One particular business process that is seeing a lot of big data analytics is supply chain or delivery route optimization. Here, geographic positioning and radio frequency identification sensors are used to track goods or delivery vehicles and optimize routes by integrating live traffic data, etc. HR business processes are also being improved using big data analytics. This includes the optimization of talent acquisition – Moneyball style, as well as the measurement of company culture and staff engagement using big data tools.

3. Personal Quantification and Performance Optimization

Big data is not just for companies and governments but also for all of us individually. We can now benefit from the data generated from wearable devices such as smart watches or smart bracelets. Take the Up band from Jawbone as an example: the armband collects data on our calorie consumption, activity levels, and our sleep patterns. While it gives individuals rich insights, the real value is in analyzing the collective data. In Jawbone’s case, the company now collects 60 years worth of sleep data every night. Analyzing such volumes of data will bring entirely new insights that it can feed back to individual users. The other area where we benefit from big data analytics is finding love - online this is. Most online dating sites apply big data tools and algorithms to find us the most appropriate matches.

4. Improving Healthcare and Public Health

The computing power of big data analytics enables us to decode entire DNA strings in minutes and will allow us to find new cures and better understand and predict disease patterns. Just think of what happens when all the individual data from smart watches and wearable devices can be used to apply it to millions of people and their various diseases. The clinical trials of the future won’t be limited by small sample sizes but could potentially include everyone! Big data techniques are already being used to monitor babies in a specialist premature and sick baby unit. By recording and analyzing every heart beat and breathing pattern of every baby, the unit was able to develop algorithms that can now predict infections 24 hours before any physical symptoms appear. That way, the team can intervene early and save fragile babies in an environment where every hour counts. What’s more, big data analytics allow us to monitor and predict the developments of epidemics and disease outbreaks. Integrating data from medical records with social media analytics enables us to monitor flu outbreaks in real-time, simply by listening to what people are saying, i.e. “Feeling rubbish today - in bed with a cold”.

5. Improving Sports Performance

Most elite sports have now embraced big data analytics. We have the IBM SlamTracker tool for tennis tournaments; we use video analytics that track the performance of every player in a football or baseball game, and sensor technology in sports equipment such as basket balls or golf clubs allows us to get feedback (via smart phones and cloud servers) on our game and how to improve it. Many elite sports teams also track athletes outside of the sporting environment – using smart technology to track nutrition and sleep, as well as social media conversations to monitor emotional wellbeing.

6. Improving Science and Research

Science and research is currently being transformed by the new possibilities big data brings. Take, for example, CERN, the Swiss nuclear physics lab with its Large Hadron Collider, the world’s largest and most powerful particle accelerator. Experiments to unlock the secrets of our universe – how it started and works - generate huge amounts of data. The CERN data center has 65,000 processors to analyze its 30 petabytes of data. However, it uses the computing powers of thousands of computers distributed across 150 data centers worldwide to analyze the data. Such computing powers can be leveraged to transform so many other areas of science and research.

7. Optimizing Machine and Device Performance

Big data analytics help machines and devices become smarter and more autonomous. For example, big data tools are used to operate Google’s self-driving car. The Toyota Prius is fitted with cameras, GPS as well as powerful computers and sensors to safely drive on the road without the intervention of human beings. Big data tools are also used to optimize energy grids using data from smart meters. We can even use big data tools to optimize the performance of computers and data warehouses.

8. Improving Security and Law Enforcement.

Big data is applied heavily in improving security and enabling law enforcement. I am sure you are aware of the revelations that the National Security Agency (NSA) in the U.S. uses big data analytics to foil terrorist plots (and maybe spy on us). Others use big data techniques to detect and prevent cyber attacks. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data use it to detect fraudulent transactions.

9. Improving and Optimizing Cities and Countries

Big data is used to improve many aspects of our cities and countries. For example, it allows cities to optimize traffic flows based on real time traffic information as well as social media and weather data. A number of cities are currently piloting big data analytics with the aim of turning themselves into Smart Cities, where the transport infrastructure and utility processes are all joined up. Where a bus would wait for a delayed train and where traffic signals predict traffic volumes and operate to minimize jams.

10. Financial Trading

My final category of big data application comes from financial trading. High-Frequency Trading (HFT) is an area where big data finds a lot of use today. Here, big data algorithms are used to make trading decisions. Today, the majority of equity trading now takes place via data algorithms that increasingly take into account signals from social media networks and news websites to make, buy and sell decisions in split seconds.

For me, the 10 categories I have outlined here represent the areas in which big data is applied the most. Of course there are so many other applications of big data and there will be many new categories as the tools become more widespread.

What do you think? Do you agree or disagree with this data revolution? Are you excited or apprehensive? Can you think of other areas where big data is used? Please share your views and comments.

Bernard Marr is a bestselling business author and is globally recognized as an expert in strategy, performance management, analytics, KPIs and big data. His latest book is 'Big Data - Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance'.

You can read a free sample chapter here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Bernard Marr

One of my favorite examples of why so many big data projects fail comes from a book that was written decades before “big data” was even conceived. In Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, a race of creatures build a supercomputer to calculate the meaning of “life, the universe, and everything.” After hundreds of years of processing, the computer announces that the answer is “42.” When the beings protest, the computer calmly suggests that now they have the answer, they need to know what the actual question is — a task that requires a much bigger and more sophisticated computer. This is a wonderful parable for big data because it illustrates one quintessential fact: data on its own is meaningless. Remember the value of data is not the data itself – it’s what you do with the data. For data to be useful you first need to know what data you need, otherwise you just get tempted to know everything and that’s not a strategy, it’s an act of desperation that is doomed to end in failure. Why go to all the time and trouble collecting data that you won’t or can’t use to deliver business insights? You must focus on the things that matter the most otherwise you’ll drown in data. Data is a strategic asset but it’s only valuable if it’s used constructively and appropriately to deliver results.

Source for picture: click here

Good questions yield better answers

This is why it’s so important to start with the right questions. If you are clear about what you are trying to achieve then you can think about the questions to which you need answers. For example, if your strategy is to increase your customer base, questions that you will need answers to might include, ‘Who are currently our customers?’, ‘What are the demographics of our most valuable customers?’ and ‘What is the lifetime value of our customers?’. When you know the questions you need answered then it’s much easier to identify the data you need to access in order to answer those key questions. For example, I worked with a small fashion retail company that had no data other than their traditional sales data. They wanted to increase sales but had no smart data to draw on to help them achieve that goal. Together we worked out that the questions they needed answers to included:

  • How many people actually pass our shops?
  • How many stop to look in the window and for how long? How many of them then come into the shop, and
  • How many then buy?

What we did was install a small, discreet device into the shop windows that tracked mobile phone signals as people walked past the shop. Everyone, at least everyone passing these particular stores with a mobile phone on them (which nowadays is almost everyone), would be picked up by the sensor in the device and counted, thereby answering the first question. The sensors would also measure how many people stopped to look at the window and for how long, how many people then walked into the store, and sales data would record who actually bought something. By combining the data from the sensors placed in the window with transaction data we were able to measure conversion ratio and test window displays and various offers to see which ones increased conversion rate. Not only did this fashion retailer massively increase sales by getting smart about the way they were combining small traditional data with untraditional Big Data but also they used the insights to make a significant saving by closing one of their stores. The sensors were able to finally tell them that the footfall reported by the market research company prior to opening in that location was wrong and the passing traffic was insufficient to justify keeping the store open.

Too much data obscures the truth

Really successful companies today are making decisions based on facts and data-driven insights. Whether you have access to tons of data or not, if you start with strategy and identify the questions you need answers to in order to deliver your outcomes then you will be on track to improve performance and harness the primary power of data. Every manager now has the opportunity to use data to support their decision-making with actual facts. But without the right questions, all those “facts” can conceal the truth. A lot of data can generate lots of answers to things that don’t really matter; instead companies should be focusing on the big unanswered questions in their business and tackling them with big data.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Bernard Marr

Companies have more data on staff than ever before in history and big data analytics is making its way into HR practices fast. Analyzing staff performance is nothing new, but the extent to which we can now collect and analyze such data is going beyond all norms.

Sociometric Solutions puts sensors into employee name badges that can detect social dynamics in the workplace. The sensors report on how employees move around the workplace, with whom they speak, and even the tone of voice they use when communicating. By analyzing data from smart badge technology Bank of America noticed that its top performing employees at call centers were those who took breaks together. They instituted group break policies and performance improved 23 percent. Another company, Humanscale, builds sensors into it’s line of office chairs, standing desks and work stations and offers companies their OfficeIQ system to monitor workplace activity such as how much time individuals have spent sitting or standing at their desk as well as how long they have been away from their desk.

In Ireland, grocery chain Tesco has its warehouse employees wear armbands that track the goods they take from the shelves, distributes tasks, and even forecasts completion time for a job. In other sectors, including healthcare and the military, wearables can detect fatigue that could be dangerous to the employee and the job they perform.

Fujitsu has just released Ubiquitouswear, a business package that can collect and analyze data from devices such as accelerometer sensors, barometers, cameras and microphones to measure and monitor people at work. For example, data such as temperature, humidity, movements and pulse rate can be used to identify when workers are exposed to too much heat stress. The system can also detect locations and even postures and body movements of humans to sense a fall, track someone’s location or estimate the physical load on a body.

Japan’s computer giant Fujitsu engineer displays a head mount display for factory work ‘Ubiquitouswear’ (Photo credit  YOSHIKAZU TSUNO/AFP/Getty Images)

The external monitor parallels what’s already being monitored inside the body using health trackers such as Fitbit. Data already show that employees engaged in wellness programs show significantly smaller increases in the cost of their health care than those who aren’t. So far, employers can’t access an individual employee’s health records, but those days may not be far off, when a boss might take you aside to discuss your stress levels or the long hours you’ve been putting in at your desk.

As the world becomes increasingly digital, companies have endless ways to monitor their staff. Most things we do in a typical workday already generate a lot of data: we send and receive emails, we make phone calls or we operate equipment. But soon, there will be so many new data sources and so many new ways of cutting that data— using cameras, sensors or crowd sourcing data to measure every aspect of someone’s performance.

Should companies use this data to monitor staff? Is it even ethical to treat us like copiers and routers? One vendor, Cornerstone onDemand, believes it can help companies predict and improve employee performance. Its analytics software is able to take over half a billion employee data points from across the world to identify patterns and make predictions about hiring decisions and employee performance.

This kind of analysis can be used to identify the most successful recruitment channels or key employees that might be at risk of leaving, but my fear is that many companies will spend too much time crunching all the things they can so easily collect data on, including how much time we sat on our office chair or how many people we have interacted with, rather than the more meaningful qualitative measures of what we did when we sat on the chair and the quality of our interactions with others.

Bernard Marr is a best-selling business author, keynote speaker and leading business performance, analytics and data expert. His latest books are ‘Big Data‘ and ‘KPIs for Dummies‘.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

The seven people you need on your Big Data team

Guest blog post by Ian Thomas

Read the original version of this post on my blog here.

Congratulations! You just got the call – you’ve been asked to start a data team to extract valuable customer insights from your product usage, improve your company’s marketing effectiveness, or make your boss look all “data-savvy” (hopefully not just the last one of these). And even better, you’ve been given carte blanche to go hire the best people! But now the panic sets in – who do you hire? Here’s a handy guide to the seven people you absolutely have to have on your data team. Once you have these seven in place, you can decide whether to style yourself more on John Sturges or Akira Kurosawa.

Before we start, what kind of data team are we talking about here? The one I have in mind is a team that takes raw data from various sources (product telemetry, website data, campaign data, external data) and turns it into valuable insights that can be shared broadly across the organization. This team needs to understand both the technologies used to manage data, and the meaning of the data – a pretty challenging remit, and one that needs a pretty well-balanced team to execute.

1. The Handyman
The Handyman can take a couple of battered, three-year-old servers, a copy of MySQL, a bunch of Excel sheets and a roll of duct tape and whip up a basic BI system in a couple of weeks. His work isn’t always the prettiest, and you should expect to replace it as you build out more production-ready systems, but the Handyman is an invaluable help as you explore datasets and look to deliver value quickly (the key to successful data projects). Just make sure you don’t accidentally end up with a thousand people accessing the database he’s hosting under his desk every month for your month-end financial reporting (ahem).

Really good handymen are pretty hard to find, but you may find them lurking in the corporate IT department (look for the person everybody else mentions when you make random requests for stuff), or in unlikely-seeming places like Finance. He’ll be the person with the really messy cubicle with half a dozen servers stuffed under his desk.

The talents of the Handyman will only take you so far, however. If you want to run a quick and dirty analysis of the relationship between website usage, marketing campaign exposure, and product activations over the last couple of months, he’s your guy. But for the big stuff you’ll need the Open Source Guru.

2. The Open Source Guru
I was tempted to call this person “The Hadoop Guru”. Or “The Storm Guru”, or “The Cassandra Guru”, or “The Spark Guru”, or… well, you get the idea. As you build out infrastructure to manage the large-scale datasets you’re going to need to deliver your insights, you need someone to help you navigate the bewildering array of technologies that has sprung up in this space, and integrate them.

Open Source Gurus share many characteristics in common with that most beloved urban stereotype, the Hipster. They profess to be free of corrupting commercial influence and pride themselves on plowing their own furrow, but in fact they are subject to the whims of fashion just as much as anyone else. Exhibit A: The enormous fuss over the world-changing effects of Hadoop, followed by the enormous fuss over the world-changing effects of Spark. Exhibit B: Beards (on the men, anyway).

So be wary of Gurus who ascribe magical properties to a particular technology one day (“Impala’s, like, totally amazing”), only to drop it like ombre hair the next (“Impala? Don’t even talk to me about Impala. Sooooo embarrassing.”) Tell your Guru that she’ll need to live with her recommendations for at least two years. That’s the blink of an eye in traditional IT project timescales, but a lifetime in Internet/Open Source time, so it will focus her mind on whether she really thinks a technology has legs (vs. just wanting to play around with it to burnish her resumé).

3. The Data Modeler
While your Open Source Guru can identify the right technologies for you to use to manage your data, and hopefully manage a group of developers to build out the systems you need, deciding what to put in those shiny distributed databases is another matter. This is where the Data Modeler comes in.

The Data Modeler can take an understanding of the dynamics of a particular business, product, or process (such as marketing execution) and turn that into a set of data structures that can be used effectively to reflect and understand those dynamics.

Data modeling is one of the core skills of a Data Architect, which is a more identifiable job description (searching for “Data Architect” on LinkedIn generates about 20,000 results; “Data Modeler” only generates around 10,000). And indeed your Data Modeler may have other Data Architecture skills, such as database design or systems development (they may even be a bit of an Open Source Guru). But if you do hire a Data Architect, make sure you don’t get one with just those more technical skills, because you need datasets which are genuinely useful and descriptive more than you need datasets which are beautifully designed and have subsecond query response times (ideally, of course, you’d have both). And in my experience, the data modeling skills are the rarer skills; so when you’re interviewing candidates, be sure to give them a couple of real-world tests to see how they would actually structure the data that you’re working with.

4. The Deep Diver
Between the Handyman, the Open Source Guru, and the Data Modeler, you should have the skills on your team to build out some useful, scalable datasets and systems that you can start to interrogate for insights. But who to generate the insights? Enter the Deep Diver.

Deep Divers (often known as Data Scientists) love to spend time wallowing in data to uncover interesting patterns and relationships. A good one has the technical skills to be able to pull data from source systems, the analytical skills to use something like R to manipulate and transform the data, and the statistical skills to ensure that his conclusions are statistically valid (i.e. he doesn’t mix up correlation with causation, or make pronouncements on tiny sample sizes). As your team becomes more sophisticated, you may also look to your Deep Diver to provide Machine Learning (ML) capabilities, to help you build out predictive models and optimization algorithms.

If your Deep Diver is good at these aspects of his job, then he may not turn out to be terribly good at taking direction, or communicating his findings. For the first of these, you need to find someone that your Deep Diver respects (this could be you), and use them to nudge his work in the right direction without being overly directive (because one of the magical properties of a really good Deep Diver is that he may take his analysis in an unexpected but valuable direction that no one had thought of before).

For the second problem – getting the Deep Diver’s insights out of his head – pair him with a Storyteller (see below).

5. The Storyteller
The Storyteller’s yin is to the Deep Diver’s yang. Storytellers love explaining stuff to people. You could have built a great set of data systems, and be performing some really cutting-edge analysis, but without a Storyteller, you won’t be able to get these insights out to a broad audience.

Finding a good Storyteller is pretty challenging. You do want someone who understands data quite well, so that she can grasp the complexities and limitations of the material she’s working with; but it’s a rare person indeed who can be really deep in data skills and also have good instincts around communications.

The thing your Storyteller should prize above all else is clarity. It takes significant effort and talent to take a complex set of statistical conclusions and distil them into a simple message that people can take action on. Your Storyteller will need to balance the inherent uncertainty of the data with the ability to make concrete recommendations.

Another good skill for a Storyteller to have is data visualization. Some of the most light bulb-lighting moments I have seen with data have been where just the right visualization has been employed to bring the data to life. If your Storyteller can balance this skill (possibly even with some light visualization development capability, like using D3.js; at the very least, being a dab hand with Excel and PowerPoint or equivalent tools) with her narrative capabilities, you’ll have a really valuable player.

There’s no one place you need to go to find Storytellers – they can be lurking in all sorts of fields. You might find that one of your developers is actually really good at putting together presentations, or one of your marketing people is really into data. You may also find that there are people in places like Finance or Market Research who can spin a good yarn about a set of numbers – poach them.

6. The Snoop
These next two people – The Snoop and The Privacy Wonk – come as a pair. Let’s start with the Snoop. Many analysis projects are hampered by a lack of primary data – the product, or website, or marketing campaign isn’t instrumented, or you aren’t capturing certain information about your customers (such as age, or gender), or you don’t know what other products your customers are using, or what they think about them.

The Snoop hates this. He cannot understand why every last piece of data about your customers, their interests, opinions and behaviors, is not available for analysis, and he will push relentlessly to get this data. He doesn’t care about the privacy implications of all this – that’s the Privacy Wonk’s job.

If the Snoop sounds like an exhausting pain in the ass, then you’re right – this person is the one who has the team rolling their eyes as he outlines his latest plan to remotely activate people’s webcams so you can perform facial recognition and get a better Unique User metric. But he performs an invaluable service by constantly challenging the rest of the team (and other parts of the company that might supply data, such as product engineering) to be thinking about instrumentation and data collection, and getting better data to work with.

The good news is that you may not have to hire a dedicated Snoop – you may already have one hanging around. For example, your manager may be the perfect Snoop (though you should probably not tell him or her that this is how you refer to them). Or one of your major stakeholders can act in this capacity; or perhaps one of your Deep Divers. The important thing is not to shut the Snoop down out of hand, because it takes relentless determination to get better quality data, and the Snoop can quarterback that effort. And so long as you have a good Privacy Wonk for him to work with, things shouldn’t get too out of hand.

7. The Privacy Wonk
The Privacy Wonk is unlikely to be the most popular member of your team, either. It’s her job to constantly get on everyone’s nerves by identifying privacy issues related to the work you’re doing.

You need the Privacy Wonk, of course, to keep you out of trouble – with the authorities, but also with your customers. There’s a large gap between what is technically legal (which itself varies by jurisdiction) and what users will find acceptable, so it pays to have someone whose job it is to figure out what the right balance between these two is. But while you may dread the idea of having such a buzz-killing person around, I’ve actually found that people tend to make more conservative decisions around data use when they don’t have access to high-quality advice about what they can do, because they’re afraid of accidentally breaking some law or other. So the Wonk (much like Sadness) turns out to be a pretty essential member of the team, and even regarded with some affection.

Of course, if you do as I suggest, and make sure you have a Privacy Wonk and a Snoop on your team, then you are condemning both to an eternal feud in the style of the Corleones and Tattaglias (though hopefully without the actual bloodshed). But this is, as they euphemistically say, a “healthy tension” – with these two pulling against one another you will end up with the best compromise between maximizing your data-driven capabilities and respecting your users’ privacy.

Bonus eighth member: The Cat Herder (you!)
The one person we haven’t really covered is the person who needs to keep all of the other seven working effectively together: To stop the Open Source Guru from sneering at the Handyman’s handiwork; to ensure the Data Modeler and Deep Diver work together so that the right measures and dimensionality are exposed in the datasets you publish; and to referee the debates between the Snoop and the Privacy Wonk. This is you, of course – The Cat Herder. If you can assemble a team with at least one of the above people, plus probably a few developers for the Open Source Guru to boss about, you’ll be well on the way to unlocking a ton of value from the data in your organization.

Think I’ve missed an essential member of the perfect data team? Tell me in the comments.

Read more…

Guest blog post by ajit jaokar

The Open Cloud – Apps in the Cloud 

Smart Data

Based on my discussions at Messe Hannover , this blog explores the potential of applying Data Science to manufacturing and process control industries. In my new course at Oxford University (Data Science for IoT) and community (Data Science and Internet of Things ), I explore application of predictive algorithms to Internet of Things (IoT) datasets. 

The Internet of Things plays a key role here because sensors in machines and process control industries generate a lot of data. This data has real, actionable business value (Smart Data). The objective of Smart data is to improve productivity through digitization. I had a chance to speak to Siemens management and engineers about how this vision of Smart Data is translated into reality


When I discussed the idea of Smart Data with Siegfried Russwurm, Prof. Dr.-Ing. - Member of the Managing Board of Siemens AG ,  he spoke of key use cases that involve transforming big data into business value by providing context, increasing efficiency  and addressing large, complex problems. These include applications for Oil rigs, wind turbines and process control industries etc. In these industries, the smallest productivity increase translates to huge commercial gains.  

This blog is my view on how this vision (Smart data) could translate into reality within the context Data Science and IoT.

Data: the main driver for Industrie 4.0 ecosystem

 At Messe  Hannover, it was hard to escape the term ‘Industry 4.0’ (in German – Industrie 4.0). Broadly, Industry 4.0 refers to the use of electronics and IT to automate production and to create intelligent networks along the entire value chain that can control each other autonomously. Machines generate a lot of Data. In many cases, if you consider the large installation such as an Oil Rig, this data is bigger than the traditional ‘Big Data’.  Its use case is also slightly different i.e. the value does not like in capturing a lot of data from outside the enterprise – but rather in capturing (and making innovative uses of) a large volume of data generated within the enterprise.  The ‘smart’ in smart data is predictive and algorithmic. Thus, Data is the main driver of Industry 4.0 and it’s important to understand the flow of Data before it can be optimized

The flow of Data in the Digital Enterprise

The ‘Digital factory’ is already a reality. For instance,  Industrial Ethernet standards like Profinet, PLM(Product lifecycle management) software like Teamcenter  and Data models for lifecycle engineering and plan management such as Comos. To extend the Digital factory  to achieve end-to-end interconnection and autonomous operation across the value chain (as is the vision of Industry 4.0), we need a component  in the architecture.  

The Open Cloud: Paving the way for Smart Data analytics

In that context,  the cooperation of Siemens with SAP to create open cloud platform. Is very interesting. The Open Cloud enables ‘apps in the cloud’  based on the intelligent use of large quantities of data. The SAP Hana architecture based on in-memory, columnar database provides analytics services in the Cloud. For instance, the "Asset Analytics"(to increase the availability of machines through online monitoring, pattern recognition, simulation,  prediction of issues) and  “Energy Analytics" ( revealing hidden energy savings potential)


While it is early days, based on the above, the manufacturing domain offers real value and tangible benefits to customers. Even now, we see the customers  who harness value from large quantities of Data through predictive analytics stand to gain significantly. I will cover this subject in more detail as it evolves. 

About the author

Ajit''s work spans research, entrepreneurship and academia relating to IoT, predictive analytics and Mobility. His current research focus is on applying data science algorithms to IoT applications. This includes Time series, sensor fusion and deep learning.  This research underpins his teaching at Oxford University (Big Data and Telecoms) and the City sciences program at the Technical University of Madrid (UPM). Ajit also runs a community/learning program through his company - futuretext for Data Science and IoT

Read more…

Predictive Analytics and Sensor Data

Large scale equipment for power generation, manufacturing, mining, and similarly sized functions are structurally important to the global economy. They turn raw materials into the energy and other products that help keep the economy running.

Consider the gas-turbine electric plant. These installations can involve multiple instances of large scale gas turbines like those manufactured by General Electric and Siemens, and they can supply power for thousands of homes and many jobs. Keeping these turbines running, continuing to turn gas into electricity, requires precise manufacturing. Bearings, blades, and shafts must be perfectly balanced in order for continued power generation, but strict maintenance schedules for addressing corrosion, fatigue, and wear are also required.

As precisely as these giant machines are made, the extreme conditions of highly compressed gas and the continuous runtime requirements of power generation invariably lead to some components failing to function appropriately. Extreme temperatures, high pressure, corrosive environments, and many other factors lead to costly interruptions in power. Nevertheless, these machines are intended to run for long periods of time, and interruptions in power generation for either maintenance or repair are very expensive. In order to minimize these interruptions, turbine manufacturers and plant operators employ statistical analysis to determine optimal plant maintenance schedules. Much like a personal vehicle, these giant power generators have parts that are designed to be replaced at planned intervals to ensure continued operation. The machines themselves are made to be repaired as quickly as possible, with parts designed to be worn out and replaced. These maintenance schedules help in planning for a steady flow of electricity generation, minimizing equipment failures through rigorous statistical analysis.

In spite of these efforts, and that the typical gas turbine has relatively few moving parts, there are occasional events that interrupt power output. Bearings could fail, a sheared blade could impact performance or even cause additional damage. These events, and many others, are tracked through multitudes of sensors in the turbine tracking vibration and temperature levels, ambient air conditions, exhaust properties, compression levels, and much more. This is where traditional statistical methods fail, giving windows for a predicted event, but not the indicators that will predict failure.

With the creation of so much sensor data however, energy producers are often overwhelmed. The constantly generated unstructured sensor data are likely to contain the many predictors that could lead to a failure, but collecting, storing, and analyzing all that data have proven to be a daunting task. The result has been that the primary approach to predictive maintenance has, to date, been limited to statistical estimations.

Instead of trying to predict failure on the macro scale, these large-scale industrial operations are in need of a solution that will make better use of the sensor data they already have. A more advanced approach involve streaming queries, a process of repeatedly querying data as it is recorded. These ongoing queries have been an advancement in data analysis, but they rely on the analyst to first identify the manner of failure and write queries for events that are well-understood, meaning that these events are identifiable, but not exactly predictive.

Advanced machine data methods that are typically used to analyze streaming network data are ideal for these large scale and sensor intensive applications. With comparable data volumes, approaches like graph analytics automate data analysis, greatly reducing the pitfalls of human bias. The approach uses software derived from graph theory to map and analyze the connections in machine and sensor data, revealing the data points that are associated with failure and allowing analysts to identify the sensor readings that correlate with costly interruptions.

These more predictive methods, coupled with traditional statistical analysis, will help modernize power generation, making the grid more efficient and more cost effective with the data that we already have. 

Originally posted on Analytic Bridge

Read more…

Welcome to Sparkling Land

Guest blog post by Fawad Alam

Note: Opinions expressed are solely my own and do not express the views or opinions of my employer.

As a data scientist who has been munging data and building machine learning models in tools like R, Python and other software(s) (open source and proprietary), I had always longed for a world without technical limitations. A world which would allow me to create data structures (data scientists usually call them vectors, matrices or dataframes) of virtually any size (i.e. big), manipulate them, and use them in machine learning models. A world where I can do all these fancy things without having to worry about whether I can fit them in memory; without having to wait for hours on end for my computations to finish (data scientists are an impatient breed) and without needing to write lots of code to merely compute a dot product between two vectors.

The seeds of my data science utopia were sown many years ago with the advent of Apache Hadoop and the MapReduce programming model. It solved the problem of storing and processing large amounts of data. A few years later, Apache Mahout was developed on top of MapReduce which provided implementations of machine learning algorithms. It all seemed too good to be true.

However IMHO, MapReduce as good as it is for lots of data processing workloads and use-cases, it didn’t seem to be best suited for data scientists. Its lack of interactive data analysis capability, coupled with the need to write very verbose mappers and reducers in Java (or in other languages using Hadoop Streaming) was never going to be liked by the Data Science community. This didn’t mean that the dream was over……..Apache Spark came to the rescue!!

For the uninitiated, Apache Spark is an in-memory, distributed, data processing engine, designed to run on top of distributed storage systems like HDFS. As a data scientist, Spark whets my appetite for a number of reasons.

To me, speed of analysis matters. It’s no good if you have to wait for hours to get results of a correlation matrix just to forget why you ran it in the first place. The ability to do train-of-thought analysis interactively on large volumes of data is one of the most important features that distinguishes Apache Spark from other data processing engines.

The ability to write succinct code to accomplish data science tasks was never the forte of MapReduce jobs. Spark has nailed this with its high-level API in Scala and Python (a widely used scripting language in the data science community). Add to this, its MLlib package which provides implementations for a number of feature extraction and machine learning techniques.

Finally, Spark processes big data. Don’t think I need to say more on this other than, in my view (with a few ifs and buts), more data beats clever algorithms. Maybe more on this in a different post.

As is the case with most things in life, every technology goes through it's peaks and troughs. For now, those data scientists who dream of a similar utopia, will find in Spark a much needed ray of hope…..Welcome to Sparkling Land.

Read more…

Charting the IoT Opportunity

Originally posted on Data Science Central

By Venkat Viswanathan and Ravi Ravishankar


As the Internet of Things (IoT) gains momentum, it’s apparent that it will force change in nearly every industry, much like the Internet did. The trend will also cause a fundamental shift in consumer behavior and expectations, as did the Internet. And just like the Internet, the IoT is going to put a lot of companies out of business.


Despite these similarities, however, the IoT is really nothing like the Internet. It’s far more complex and challenging.


Lack of Standardization

Unlike the Internet, where the increased need for speed and memory was addressed as a by-product of the devices themselves, the sensors and devices connecting to the IoT network have, for the most part, inadequate processing or memory. Furthermore, no standard exists for communication and interoperability between these millions of devices. Samsung, Intel, Dell and other hardware manufacturers have set up a consortium to address this issue. Another equally powerful consortium formed by Haier, Panasonic, Qualcomm and others aims to do the exact same thing. This has raised concerns that each of these groups will engage in a battle to push their standard, resulting in no single solution.


New Communication Frontier

The Internet was designed for machine to human interactions. The IoT, on the other hand, is intended for machine-to-machine communications, which is very different in nature. The network must be able to support diverse equipment and sensors that are trying to connect simultaneously, and also manage the flow of large quantities of incredibly diverse data...all at very low costs. To meet these requirements, a completely new ecosystem—independent of the Internet—must evolve.


Data Privacy

The IoT also raises serious challenges for data security and privacy. Justified consumer concerns will call for stricter privacy standards and demand a greater role in determining what data they will share. These aren’t the only security issues likely to arise. In order for a complete IoT ecosystem to emerge, multiple players must use data from connected devices—but who owns the data? Is it the initial device that emits it, or the service provider that transports that information, or the company that uses it to provide the consumer better service offerings?


Geographic Challenges

For multinational organizations with data coming from various regions around the globe, things get even more complicated. Different countries have different data privacy laws. China and many parts of the EU, for example, will not let companies take data about their citizens out of their borders. This will result in the emergence of data lakes. To enable business decisions, companies must be able to access data within various geographies, run their analysis locally and disseminate the insights back to their headquarters…all in real-time and at low costs.   


In spite of all these challenges, the IoT is not something companies can afford to keep at arm’s length. Like the Internet, it will empower consumers with more data and insights than ever before, and they in turn will force companies to change the way they do business. From an analytics perspective, it’s very exciting. Companies will now have access to quality data that, if they combine it with other sources of information, can provide them with immense opportunities to stay relevant.


As an example, let’s look at the medical equipment industry. Typically these companies determine what equipment to sell based on parameters like number of beds and whether the facility is in a developing or developed market. However, these and other metrics are a poor substitute for evaluating need based on actual use. A small hospital in a developing country, for example, will diagnose and treat a much wider range of diseases than a similar facility in a more developed region. By equipping the machines with sensors, these manufacturers can obtain a better understanding of what is occurring within each facility and optimize selling decisions more effectively as a result.


This is just one example to underscore the tremendous potential that the IoT holds for businesses. In order to truly realize these and other opportunities, companies must understand the challenges outlined above and have a framework in place to address them. In the early days of the Internet, few could have predicted its transformative impact on all facets of our lives—personal and professional. As the IoT heads into its next phase of maturity, we can expect to see a similar effect emerge.


Ravi Ravishankar is Global Head of Product Marketing and Management at Equinix's Products, Services and Solutions Group and Venkat Viswanathan is Chairman at LatentView Analytics.



Read more…

Environmental Monitoring using Big Data

Guest blog post by Heinrich von Keler

In this post, I will cover in-depth a Big Data use case: monitoring and forecasting air pollution.

A typical Big Data use case in the modern Enterprise includes the collection and storage of sensor data, executing data analytics at scale, generating forecasts, creating visualization portals, and automatically raising alerts in the case of abnormal deviations or threshold breaches.

This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using Axibase Time-Series Database and R Language.

Steps taken by the data science team to execute the use case:

  • Collect historical data from AirNow into ATSD
  • Stream current data from AirNow into ATSD
  • Use R Language to execute data analytics and generate forecasts for all collected entities and metrics
  • Create Holt-Winters forecasts in ATSD for all collected entities and metrics
  • Build a visualization portal
  • Setup alert and notification rules in the ATSD Rule Engine

The Data

Hourly readings of several key air quality metrics are being generated by over 2,000 monitoring sensor stations located in over 300 cities across the United States, the historical and streaming data is retrieved and stored in ATSD.

The data is provided by AirNow, which is a U.S. government EPA program that protects public health by providing forecast and real-time air quality information.

The two main collected metrics are PM2.5 and Ozone (o3).

PM2.5 is particles less than 2.5 micrometers in diameter, often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, including motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.

o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface where it forms a protective layer that shields us from the sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone.

Other collected metrics are: pm10 (particulate matter up to 10 micrometers in size), co (Carbon Monoxide), no2 (nitrogen dioxide) and so2 (sulfur dioxide).

Collecting/Streaming the Data

A total of 5 years of historical data has been collected, stored, analyzed and accurately forecast. In order for the forecasts to have maximum accuracy, account for trends and for seasonal cycles, at least 3 to 5 years of detailed historical data is recommended.

An issue with the accuracy of the data was immediately determined. The data was becoming available with a fluctuating time delay of 1 to 3 hours. An analysis was conducted by collecting all values for each metric and entity, resulting in several data points being recorded for the same metric, entity and time. This led us to believe that there was both a time delay and stabilization period. Below are the results:

Once available, the data then took another 3 to 12 hours to stabilize, meaning that the values were fluctuating during that time frame for most data points.

As a result of this analysis, it was decided, that all data will be collected with a 12 hour delay in order to increase the accuracy of the data and forecasts.

Axibase Collector was used to collect the data from monitoring sensor stations and stream into Axibase Time-Series Database.

In Axibase Collector a job was setup to collect data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered one of the most polluted cities in the United States, with air quality warnings being often issued to the public.

The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.

The File Forwarding Configuration is a parser configuration for data incoming from an external source. The path to the external data source is specified, a default entity is assigned to the Fresno monitoring sensor station, start time and end time determine the time frame for retrieving new data (end time syntax is used).

Once these two configurations are saved, the collector starts streaming fresh data into ATSD.

The entities and metrics streamed by the collector into ATSD can be viewed from the UI.

The whole data-set currently has over 87,000,000 records for each metric, all stored in ATSD.

Generating Forecasts in R

The next step was to analyze the data and generate accurate forecasts. Built-in Holt-Winters and Arima algorithms were used in ATSD and custom R language data forecasting algorithms were used for comparison.

To analyze the data in R, the R language API client was used to retrieve the data and then save the custom forecasts back into ATSD.

Forecasts were built for all metrics for the period  of May, 11 until June, 1.

The steps taken to forecast the pm2.5 metric will be highlighted.

The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.

Recommendations from the following sources were used to choose parameters for SSA forecasting:

The following steps were executed when building the forecasts:

pm2.5 series was retrieved from ATSD using the query() function. 72 days of data were loaded.
SSA decomposition was built with a window of 24 days and 100 eigen triples:

dec <- ssa(values, L = 24 * 24, neig = 100)

eigen values, eigen vectors, pairs of sequential eigen vectors and w-correlation matrix of the decomposition were graphed:

plot(dec, type = "values")

plot(dec, type = "vectors", idx = 1:20)

plot(dec,type = "paired", idx = 1:20)

plot(wcor(dec), idx = 1:100)

A group of eigen triples was then selected to use when forecasting. The plots suggest several options.

Three different options were tested: 1, 1:23, and 1:35, because groups 1, 2:23 and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.

The rforecast() function was used to build the forecast:

rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")

Tests were run with vforecast(), and bforecast() using different parameters, but rforecast() was determined to be the best option in this case.

Graph of the original series and three resulting forecasts:

Forecast with eigen triples 1:35 was selected as the most accurate and saved into ATSD.

  • To save forecasts into ATSD the save_series() function was used.

Generating Forecasts in ATSD

The next step was to create a competing forecast in ATSD using the built-in forecasting features. Majority of the settings were left in automatic mode, so the system itself determines the best parameters (based on the historical data) when generating the forecast.

Visualizing the Results

To visualize the data and forecasts, a portal was created using the built-in visualization features.

Thresholds have been set for each metric, in order to alert the user when either the forecast or actual data are reaching unhealthy levels of air pollution.

When comparing the R forecasts and ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing the patterns and trends with more certainty. Until this point in time, as the actual data is coming in, it is following the ATSD forecast very closely, any deviations are minimal and fall within the confidence interval.

It is clear that the built-in forecasting of ATSD often produces more accurate results than even one of the most advanced R language forecasting algorithms that was used as part of this use case. It is absolutely possible to rely on ATSD to forecast air pollution for few days/weeks into the future.

You can keep track of how these forecasts perform in comparison to the actual data in Chart Lab.

Alerts and Notifications

A smart alert notification was setup in the Rule Engine to notify the user by email if the pollution levels breach the set threshold or deviate from the ATSD forecast.

Analytical rules set in Rule Engine for pm2.5 metric – alerts will be raised if the streaming data satisfies one of the rules:

value > 30 - Raise an alert if last metric value exceeds threshold

forecast_deviation(avg()) > 2 - Raise an alert if the actual values exceeds the forecast by more than 2 standard deviations, see image below. Smart rules capture extreme spikes in air pollution.

At this point the use case is fully implemented and will function autonomously; ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.

Results and Conclusions

The results of this use case are useful for travelers, for whom it is important to have an accurate forecast of environmental and pollution related issues that they may face during their visits or for expats moving to work in a new city or country. Studies have proven that long-term exposure to high levels of pm2.5 can lead to serious health issues.

This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like ShanghaiBeijing and Guangzhou, pm2.5 levels are constantly fluctuating from unhealthy to critical levels and yet accurate forecasting is limited. Pm2.5 forecasting is critical for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks associated with exposure to this sort of pollution.

Government agencies can also take advantage of pollution monitoring to plan and issue early warnings to travelers and locals, so that precautions can be taken to prevent exposure to unhealthy levels of pm2.5 pollution. Detecting a trend and raising an alert prior to pm2.5 levels breaching the unhealthy threshold is critical for public safety and health. Having good air quality data and performing data analytics can allow people to adapt and make informed decisions.

Big Data Analytics is an empowerment tool that can put valuable information in the hands of corporations, governments and individuals, and that knowledge can help motivate or give people tools to stimulate change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation will only get worse. Often the exact source of the air pollution, how it’s interacting in the air and how it’s dispersing cannot be determined, the lack of such information makes it a difficult problem to tackle. With advances in modern technologies and new Big Data solutions, it is becoming possible to combine sensor data with meteorological satellite data to perform extensive data analytics and forecasting. Through Big Data analytics it will be possible to pinpoint the pollution source and dispersion trends days in advanced.

I sincerely believe that Big Data has a large role to play in tackling air pollution and that in the coming years advanced data analytics will be a key tool influencing government decisions and regulation change.

You can learn more about Big Data analytics, forecasting and visualization at Axibase.

Read more…

This book is available online from O'REILLY library.

Fluent Python, 1st Edition By: Luciano Ramalho Released: July 2015


“Python is an easy to learn, powerful programming language.” Those are the first words of the official Python Tutorial. That is true, but there is a catch: because the language is easy to learn and put to use, many practicing Python programmers leverage only a fraction of its powerful features.

An experienced programmer may start writing useful Python code in a matter of hours. As the first productive hours become weeks and months, a lot of developers go on writing Python code with a very strong accent carried from languages learned before. Even if Python is your first language, often in academia and in introductory books it is presented while carefully avoiding language-specific features.

As a teacher introducing Python to programmers experienced in other languages, I see another problem that this book tries to address: we only miss stuff we know about. Coming from another language, anyone may guess that Python supports regular expressions, and look that up in the docs. But if you’ve never seen tuple unpacking or descriptors before, you will probably not search for them, and may end up not using those features just because they are specific to Python.

This book is not an A-to-Z exhaustive reference of Python. Its emphasis is on the language features that are either unique to Python or not found in many other popular languages. This is also mostly a book about the core language and some of its libraries. I will rarely talk about packages that are not in the standard library, even though the Python package index now lists more than 60,000 libraries and many of them are incredibly useful.


Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Data Science Jobs