Subscribe to our Newsletter

All Posts (76)

The curse of big data

Originally posted on Data Science Central, by Dr. Vincent Granville

This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.

In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be

  • entirely caused by chance (just like someone who wins at the lottery wins purely by chance) and
  • not replicable,
  • having no predictive power,
  • but obscuring weaker patterns that are ignored yet have a strong predictive power.

The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?

Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.

To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.

Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.

Let's introduce a new parameter, denoted as m, representing the number of paired (bi-variate), independent time series selected out of the set of k time series at our disposal: we want to compute correlations for these m pairs of time series. Theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being e.g. a simulated normalized stock price at a given time for two different stocks), what are the chances that among the m correlation coefficients, at least one is higher than 0.80?

Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, let's assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically, that is approximately when n is not small) normal with mean = 0 and variance = 1/(n-1). The probability that r is larger than a given large number a (say a=0.80, meaning a strong correlation) is p=P(r>a) with P representing a normal distribution with mean = 0 and variance = 1/(n-1). The probability that, among the m bivariate (paired) time series, at least one has of correlation above a=0.80 is thus equal to 1-[(1-p)^m], that is, 1 minus (1-p) at power m.

For instance,

  • If n=20, m=10,000 (10,000 paired time series each with 20 observations), then the chance that your conclusion is wrong (that is, a=0.80) is 90.93%.
  • If n=20, m=100,000 (still a relatively small value for m), then the chance that your conclusion is VERY wrong (that is, a=0.90) is 98.17%.

Now in practice the way it works is as follows: you have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations, that is m = k*(k-1)/2. However the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a very high value such as a=0.90. But also, your data is not random numbers, it's not white noise. So the theoretical correlations are much above absolute 0, maybe around 0.15 when n=20. Also m will be much higher than (say) 10,000 or 100,000 even when you have as few as k=1,000 time series (say one time series for each stock price). These three factors (non independence, theoretical r different from 0, very large m) balance out and make my above computations still quite accurate when applied to a real typical big data problem. Note that I computed my probabilities using the online calculator stattrek.

Conclusion: hire the right data scientist before attacking big data problems. He/she does not need to be highly technical, but able to think in a way similar to my above argumentation, to identify possible causes of model failures before even writing down a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of stats. Being a computer scientist also helps to scale your algorithms and make them simple and efficient. Being a MBA analyst also helps to understand the problem that needs to be solved. Being the three types of guy at the same time is even far better. And yes these guys do exist and are not that rare.

Exercise:

Let's say you have 3 random variables X, Y, Z with corr(X,Y)=0.70, corr(X,Z)=0.80. What is the minimum value for corr(Y,Z). Can this correlation be negative?

Related articles:

Read more…

Guest blog post by ajit jaokar

Often, Data Science for IoT differs from conventional data science due to the presence of hardware.

Hardware could be involved in integration with the Cloud or Processing at the Edge (which Cisco and others have called Fog Computing).

Alternately, we see entirely new classes of hardware specifically involved in Data Science for IoT(such as synapse chip for Deep learning)

Hardware will increasingly play an important role in Data Science for IoT.

A good example is from a company called Cognimem which natively implements classifiers(unfortunately, the company does not seem to be active any more as per their twitter feed)

In IoT, speed and real time response play a key role. Often it makes sense to process the data closer to the sensor.

This allows for a limited / summarized data set to be sent to the server if needed and also allows for localized decision making.  This architecture leads to a flow of information out from the Cloud and the storage of information at nodes which may not reside in the physical premises of the Cloud.

In this post, I try to explore the various hardware touchpoints for Data analytics and IoT to work together.

Cloud integration: Making decisions at the Edge

Intel Wind River edge management system certified to work with the Intel stack  and includes capabilities such as data capture, rules-based data analysis and response, configuration, file transfer and  Remote device management

Integration of Google analytics into Lantronix hardware –  allows sensors to send real-time data to any node on the Internet or to a cloud based application.

Microchip integration with Amazon Web services  uses an  embedded application with the Amazon Elastic Compute Cloud (EC2) service. Based on  Wi-Fi Client Module Development Kit . Languages like Python or Ruby can be used for development

Integration of Freescale and Oracle which consolidates data collected from multiple appliances from multiple Internet of things service providers.

Libraries

Libraries are another avenue for analytics engines to be integrated into products – often at the point of creation of the device. Xively cloud services is an example of this strategy through xively libraries

APIs

In contrast, keen.io provides APIs for IoT devices to create their own analytics engines ex (smartwatch Pebble’s using of keen.io)  without locking equipment providers into a particular data architecture.

Specialized hardware

We see increasing deployment  of specialized hardware for analytics. Ex egburt from Camgian which uses sensor fusion technolgies for IoT.

In the Deep learning space, GPUs are widely used and more specialized hardware emerges such asIBM’s synapse chip. But more interesting hardware platforms are emerging such as Nervana Systemswhich creates hardware specifically for Neural networks.

Ubuntu Core and IFTTT spark

Two more initiatives on my radar deserve a space in themselves – even when neither of them have currently an analytics engine:  Ubuntu Core – Docker containers+lightweight Linux distribution as an IoT OS and IFTTT spark initiatives

Comments welcome

This post is leading to vision for Data Science for IoT course/certification. Please sign up on the link if you wish to know more when launched in Feb.

Image source: cognimem

Read more…

Internet of Things and Bayesian Networks

Guest blog post by Punit Kumar Mishra

As big data becomes more of cliche with every passing day, do you feel Internet of Things is the next marketing buzzword to grapple our lives.

So what exactly is Internet of Thing (IoT) and why are we going to hear more about it in the coming days.

Internet of thing (IoT) today denotes advanced connectivity of devices,systems and services that goes beyond machine to machine communications and covers a wide variety of domains and applications specifically in the manufacturing and power, oil and gas utilities.

An application in IoT can be an automobile that has built in sensors to alert the driver when the tyre pressure is low. Built-in sensors on equipment's present in the power plant which transmit real time data and thereby enable to better transmission planning,load balancing. In oil and gas industry, it can help in planning better drilling, track cracks in gas pipelines.

IoT will lead to better predictive maintenance in the manufacturing and utilities and this is will in turn lead to better control, track, monitor or back-up of the process. Even a small percentage improvement in machine performance can significantly benefit the company bottom line.

IoT in some ways is to going to make our machines more brilliant and reactive.

According to GE, 150 Billion dollars in waste across major industries can be eliminated by IoT.

There can be questions that how is IoT different from a SCADA (supervisory control and data acquistion) systems which gets extensively used in the manfucturing industries.

IoT can be considered to be an evolution on the data acquisition part of the SCADA systems.

SCADA has been basically considered to be systems in silos with the data accessible to few people and not leading to long term benefit.

IoT starts with embedding advanced sensors in machines and collecting the data for advanced analytics.

As we start receiving data from the sensors , one important aspect that needs all the focus is the data transmitted correct or erroneous.

How do we validate the data quality.

We are dealing with uncertainty out here.

One of the most commonly used methods for modelling uncertainty is Bayesian networks.

Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph.

Bayesian networks can be used extensively in Internet of things projects to ascertain data transmitted by the sensors.

Read more…

Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let's say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a click-out, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hanging fruits: click data. Other valuable data is impression data (for instance a click not associated with an impression is very suspicious). But impression data is huge, 20 times bigger than click data. We ignore impression data here.

Here, we work with complete click data collected over a 7-day time period. Let's assume that we have 50 million clicks in the data set. Working with a sample is risky, because much of the fraud is spread across a large number of affiliates, and involve clusters (small and large) of affiliates, and tons of IP addresses but few clicks per IP per day (low frequency).

The data set (ideally, a tab-separated text file, as CSV files can cause field misalignment here due to text values containing field separators) contains 60 fields: keyword (user query or advertiser keyword blended together, argh...), referral (actual referral domain or ad exchange domain, blended together, argh...), user agent (UA, a long string; UA is also known as browser, but it can be a bot), affiliate ID, partner ID (a partner has multiple affiliates), IP address, time, city and a bunch of other parameters.

The first step is to extract the relevant fields for this quick analysis (a few days of work). Based on domain expertise, we retained the following fields:

  • IP address
  • Day
  • UA (user agent) ID - so we created a look-up table for UA's
  • Partner ID
  • Affiliate ID

These 5 metrics are the base metrics to create the following summary table. Each (IP, Day, UA ID, Partner ID, Affiliate ID) represents our atomic (most granular) data bucket.

Building a summary table: the Map step

The summary table will be built as a text file (just like in Hadoop), the data key (for joins or groupings) being (IP, Day, UA ID, Partner ID, Affiliate ID). For each atomic bucket (IP, Day, UA ID, Partner ID, Affiliate ID) we also compute:

  • number of clicks
  • number of unique UA's
  • list of UA

The list of UA's, for a specific bucket, looks like ~6723|9~45|1~784|2, meaning that in the bucket in question, there are three browsers (with ID 6723, 45 and 784), 12 clicks (9 + 1 + 2), and that (for instance) browser 6723 generated 9 clicks.

In Perl, these computations are easily performed, as you sequentially browse the data. The following updates the click count:

$hash_clicks{"IP\tDay\tUA_ID\tPartner_ID\tAffiliate_ID"};

Updating the list of UA's associated with a bucket is a bit less easy, but still almost trivial.

The problem is that at some point, the hash table becomes too big and will slow down your Perl script to a crawl. The solution is to split the big data in smaller data sets (called subsets), and perform this operation separately on each subset. This is called the Map step, in Map-Reduce. You need to decide which fields to use for the mapping. Here, IP address is a good choice because it is very granular (good for load balance), and the most important metric. We can split the IP address field in 20 ranges based on the first byte of the IP address. This will result in 20 subsets. The splitting in 20 subsets is easily done by browsing sequentially the big data set with a Perl script, looking at the IP field, and throwing each observation in the right subset based on the IP address.

Building a summary table: the Reduce step

Now, after producing the 20 summary tables (one for each subset), we need to merge them together. We can't simply use hash table here, because they will grow too large and it won't work - the reason why we used the Map step in the first place.

Here's the work around:

Sort each of the 20 subsets by IP address. Merge the sorted subsets to produce a big summary table T. Merging sorted data is very easy and efficient: loop over the 20 sorted subsets with an inner loop over the observations in each sorted subset; keep 20 pointers, one per sorted subset, to keep track of where you are in your browsing, for each subset, at any given iteration.

Now you have a big summary table T, with multiple occurrences of the same atomic bucket, for many atomic buckets. Multiple occurrences of a same atomic bucket must be aggregated. To do so, browse sequentially table T (stored as text file). You are going to use hash tables, but small ones this time. Let's say that you are in the middle of a block of data corresponding to a same IP address, say 231.54.86.109 (remember, T is ordered by IP address). Use

$hash_clicks_small{"Day\tUA_ID\tPartner_ID\tAffiliate_ID"};

to update (that is, aggregate click count) corresponding to atomic bucket (231.54.86.109, Day, UA ID, Partner ID, Affiliate ID). Note one big difference between $hash_clicks and $hash_clicks_small: IP address is not part of the key in the latter one, resulting in hash tables millions of time smaller. When you hit a new IP address when browsing T, just save the stats stored in $hash_small and satellites small hash tables for the previous IP address, free the memory used by these hash tables, and re-use them for the next IP address found in table T, until you arrived at the end of table T.

Now you have the summary table you wanted to build, let's call it S. The initial data set had 50 million clicks and dozens of fields, some occupying a lot of space. The summary table is much more manageable and compact, although still far too large to fit in Excel.

Creating rules

The rule set for fraud detection will be created based only on data found in the final summary table S (and additional high-level summary tables derived from S alone). An example of rule is "IP address is active 3+ days over the last 7 days". Computing the number of clicks and analyzing this aggregated click bucket, is straightforward, using table S. Indeed, the table S can be seen as a "cube" (from a database point of view), and the rules that you create simply narrow down on some of the dimensions of this cube. In many ways, creating a rule set consists in building less granular summary tables, on top of S, and testing. 

Improvements

IP addresses can be mapped to an IP category, and IP category should become a fundamental metric in your rule system. You can compute summary statistics by IP category. See details in my article Internet topology mapping. Finally, automated nslookups should be performed on thousands of test IP addresses (both bad and good, both large and small in volume).

Likewise, UA's (user agents) can be categorized, a nice taxonomy problem by itself. At the very least, use three UA categories: mobile, (nice) crawler that identifies itself as a crawler, and other. The use of UA list such as ~6723|9~45|1~784|2 (see above) for each atomic bucket is to identify schemes based on multiple UA's per IP, as well as the type of IP proxy (good or bad) we are dealing with.

Historical note: Interestingly, the first time I was introduced to a Map-Reduce framework was when  I worked at Visa in 2002, processing rather large files (credit card transactions). These files contained 50 million observations. SAS could not sort them, it would make SAS crashes because of the many and large temporary files SAS creates to do  big sort. Essentially it would fill the hard disk. Remember, this was 2002 and it was an earlier version of SAS, I think version 6. Version 8 and above are far superior. Anyway, to solve this sort issue - an O(n log n) problem in terms of computational complexity - we used the "split / sort subsets / merge and aggregate" approach described in my article.

Conclusion

I showed you how to extract/summarize data from large log files, using Map-Reduce, and then creating an hierarchical data base with multiple, hierarchical levels of summarization, starting with a granular summary table S containing all the information needed at the atomic level (atomic data buckets), all the way up to high-level summaries corresponding to rules. In the process, only text files are used. You can call this an NoSQL Hierarchical Database (NHD). The granular table S (and the way it is built) is similar to the Hadoop architecture.

Originally posted on DataScienceCentral, by Dr. Granville.

Read more…
BAB - The Ultimate Gaming Workstation Server What makes a computer blistering fast? The answer really depends on what you want to do with it and can even be quite complex depending on your requirements. Take for instance bitcoin mining. Custom bitcoin mining rigs can appear very unusual since many prefer to use graphics cards for the bulk of their bitcoin processing power.
Read more…

20 Big Data Repositories You Should Check Out

This is an interesting listing created by Bernard Marr. I would add the following great sources:

Source for the picture: click here

Bernard's selection:

  1. Data.gov 
  2. US Census Bureau 
  3. European Union Open Data Portal 
  4. Data.gov.uk 
  5. The CIA World Factbook 
  6. Healthdata.gov 
  7. NHS Health and Social Care Information Centre 
  8. Amazon Web Services public datasets 
  9. Facebook Graph 
  10. Gapminder 
  11. Google Trends 
  12. Google Finance 
  13. Google Books Ngrams 
  14. National Climatic Data Center 
  15. DBPedia 
  16. Topsy 
  17. Likebutton 
  18. New York Times 
  19. Freebase 
  20. Million Song Data Set 

Originally posted on DataScienceCentral

Related articles

Read more…

Common Problems with Data

When learning data science a lot of people will use sanitized datasets they downloaded from somewhere on the internet, or the data provided as part of a class or book. This is all well and good, but working with “perfect” datasets that are ideally suited to the task prevents them from getting into the habit of checking data for completeness and accuracy.

Out in the real world, while working with data for an employer or client, you will undoubtedly run into issues with data that you will need to check for and fix before being able to do any useful analysis. Here are some of the more common problems I’ve seen:

  • Apostrophes – I absolutely hate apostrophes, also know as “single quotes”, because they are part of so many company names (or last names, if you’re Irish) yet so many databases and anayltics programs choke on them. In a CSV you can just search and destroy, but other cases aren’t so easy. And what if the dataset really does include quotes for some reason? You’ll have to find and replace by column rather than en masse.
  • Misspellings or multiple spellings – God help the data scientist whose dataset includes both “Lowe’s” (the home improvement company) and “Loews” (the hotel company). You’ll have “lowe’s,” “Lowe’s,” “Lows,” “Loew’s,” “loews” and probably some I’m not even listing. Which is which? The best way to fix is by address, if that’s included in the dataset. If not, good luck.
  • Not converting currency – Ever had a client who assumed that dollars were dollars, whether they came from Singapore or the USA? And if you’re forced to convert after the fact, which exchange rate should you use? The one for the date of the transaction, the one for the date it cleared, or something else?
  • Different currency formats – Some use a comma to signify thousands, some use periods.
  • Different date formats – Is it Month/Date/Year, or is it Date/Month/Year? Depends on who you ask. As with many things this is different outside the US versus inside.
  • Using zero for null values – Sometimes a problem, sometimes not. But you have to know the difference. Applying the fix is easy enough, knowing when to do it is the key.
  • Assuming a number is really a number - In most analytics software you should treat certain numbers (ZIP codes, for example) as text. Why? Because the number doesn’t represent a count of something, it represents a person, place, or selection. Rule of thumb: if it’s not a quantity, it’s probably not a number.
  • Analytics software that only accepts numbers – In RapidMiner, for example, you have to convert binary options (“yes” and “no,” or “male” and “female”) to 1 and 0.

These are just a few of the more common issues I’ve seen in the field. What have you come across?

Originally posted on DataScienceCentral by Randal Scott King

Read more…

The Elements of Statistical Learning (Data Mining, Inference, and Prediction)

Hastie, Tibshirani and Friedman. Springer-Verlag.

 During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.

This book is available here

 

 

 

Read more…

The Only Skill you Should be Concerned With

Skills, skills, skills!!! Which ones should I learn? Which ones do I need to land the job, to impress the client, to prepare for the future, to stay relevant? What programming languages should I learn? What technologies should I master? What business books should I read? Is there a course I can take, or a certification I can enroll in? Should I focus on being a specialist to ensure I am always the "go-to person" despite commoditization, or should I concentrate on generalist skills so I can always see the forest for the trees? A mixture of both? Is there a Roadmap? A Bible? A Guru? Help!!! Look. Languages change, technologies evolve, and so-called experts come and go. Just when that awesome course ends something new pops up that the course didn't cover. Just when you became an R ninja, Python came around the corner and became the de facto. Just when you finally mastered how to lay out a kick-ass data pipeline using Hadoop, Spark became the new thing.
Read more…

R Tutorial for Beginners: A Quick Start-Up Kit

Learn R: A Statistical Programming Language Here's my quick start-up kit for you. Install R Linux: "sudo apt-get install r-base" should do it Windows: go get it here Open a Script Windows alongside the Console window when you run R It should look something like this. Your Console allows typing direct, hit and R runs the line. If it goes to prompt (the Red ">"), then that command processed. Console and Script windows Your script file is for typing in as much as you want. To run whatever is there, highlight what you want to run and hit Ctrl+R or the icon on top. It will run in the console. This basic setup is useful over to begin. The quickest approach is to go to the Appendix of the Intro Manual and walk though typing in all the commands to see how it basically works. You'll see quickly that you feed equations, functions, values, objects, etc. from the right to the named variable or object on the left using the " <- " characters.
Read more…
After getting oriented to the research problems of phenology, understanding data collection and storage, and discussing the statistical methods and approaches during the past few days of our expedition to Acadia National Park, we dug into solutions and designs on day four. Fundamentally, more complete and accurate data sets around bird migration, barnacle abundance, weather, duck population, and water resource data all help us understand the impact of climate change. Today’s effort was focused on the questions to seek answers to, the data sources to ingest, the models to build, and the visualizations to share with others, ultimately leading to a solution and approach.
Read more…
In this series, we provided an introduction to the project and cited specific technology improvements that could transform the way phenology is studied by using stationary camera networks and machine based image processing on big data sets and using big data platforms. With day one and two behind us, our team spent the day learning about current data archives, weather station sensors, data processing issues, current models used, and visualizations. Even though this week trip is only half over, here are very clear ways that technology can change the way science is practiced today, and I will share these concepts below.
Read more…
In the first post of this series, we gave the background on our data science expedition to Acadia National Park, and now we are seeing its transformative potential. As a representative from Pivotal and EMC, our goal is to help a team of phenology scientists improve the way they use big data platforms as well as data science tools and techniques to improve their research and fast-forward our understanding of climate change. In this post, I wanted to share what we experienced in the field for Day 2—actually collecting data on bird migration and aquatic life in tidal pools, as well as thinking about how to automate and improve the quality of these data collection processes. I’m happy to report, in just 2 days, we’ve begun formulating ways to use a network of stationary cameras, image processing technology, data lakes, and mobile apps to help automate the process—ultimately helping scientists spend more time on science and less time on administrative tasks.
Read more…
As data scientists, we get excited about using our talents to solve problems like global climate change and worldwide environmental policy. This week, I have the opportunity to represent Pivotal and team with other experts from EMC, Earthwatch, and Schoodic Institute to spend a week at Acadia National Park. We will be applying data science to the science of phenology—the study of periodic plant and animal life cycle events and how these are influenced by seasonal and inter-annual variations in climate. Ultimately, the work will help scientists and researchers to better collect, store, manage, and monitor data, helping us all understand how and why our climate is changing and what the impact is on plants, animals, and humans.
Read more…
Deep learning is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications in finance. The computational complexity of a deep learning network dictates need for a distributed realization. Our intention is to parallelize the training phase of the network and consequently reduce training time. We have built the first prototype of our distributed deep learning network over Spark, which has emerged as a de-facto standard for realizing machine learning at scale.
Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Data Science Jobs

Text Analyst, Americas

Confirmit - Confirmit is looking to recruit an experienced Text Analyst, reporting to the Director, Analytics as part of Confirmit’s Global Analytics team.  Yo...

Data Science Platform Administrator

WESTAT - Westat is an employee-owned corporation providing research services to agencies of the U.S. Government, as well as businesses, foundations, and sta...

Full Stack Engineer, Billing - Stripe

Stripe - Full stack engineers use their combinatory powers to bring systems together to build amazing experiences   At Stripe, we are working hard at growin...