Subscribe to our Newsletter

Featured Posts (70)

BAB - The Ultimate Gaming Workstation Server What makes a computer blistering fast? The answer really depends on what you want to do with it and can even be quite complex depending on your requirements. Take for instance bitcoin mining. Custom bitcoin mining rigs can appear very unusual since many prefer to use graphics cards for the bulk of their bitcoin processing power.
Read more…
Deep learning is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications in finance. The computational complexity of a deep learning network dictates need for a distributed realization. Our intention is to parallelize the training phase of the network and consequently reduce training time. We have built the first prototype of our distributed deep learning network over Spark, which has emerged as a de-facto standard for realizing machine learning at scale.
Read more…

Common Problems with Data

When learning data science a lot of people will use sanitized datasets they downloaded from somewhere on the internet, or the data provided as part of a class or book. This is all well and good, but working with “perfect” datasets that are ideally suited to the task prevents them from getting into the habit of checking data for completeness and accuracy.

Out in the real world, while working with data for an employer or client, you will undoubtedly run into issues with data that you will need to check for and fix before being able to do any useful analysis. Here are some of the more common problems I’ve seen:

  • Apostrophes – I absolutely hate apostrophes, also know as “single quotes”, because they are part of so many company names (or last names, if you’re Irish) yet so many databases and anayltics programs choke on them. In a CSV you can just search and destroy, but other cases aren’t so easy. And what if the dataset really does include quotes for some reason? You’ll have to find and replace by column rather than en masse.
  • Misspellings or multiple spellings – God help the data scientist whose dataset includes both “Lowe’s” (the home improvement company) and “Loews” (the hotel company). You’ll have “lowe’s,” “Lowe’s,” “Lows,” “Loew’s,” “loews” and probably some I’m not even listing. Which is which? The best way to fix is by address, if that’s included in the dataset. If not, good luck.
  • Not converting currency – Ever had a client who assumed that dollars were dollars, whether they came from Singapore or the USA? And if you’re forced to convert after the fact, which exchange rate should you use? The one for the date of the transaction, the one for the date it cleared, or something else?
  • Different currency formats – Some use a comma to signify thousands, some use periods.
  • Different date formats – Is it Month/Date/Year, or is it Date/Month/Year? Depends on who you ask. As with many things this is different outside the US versus inside.
  • Using zero for null values – Sometimes a problem, sometimes not. But you have to know the difference. Applying the fix is easy enough, knowing when to do it is the key.
  • Assuming a number is really a number - In most analytics software you should treat certain numbers (ZIP codes, for example) as text. Why? Because the number doesn’t represent a count of something, it represents a person, place, or selection. Rule of thumb: if it’s not a quantity, it’s probably not a number.
  • Analytics software that only accepts numbers – In RapidMiner, for example, you have to convert binary options (“yes” and “no,” or “male” and “female”) to 1 and 0.

These are just a few of the more common issues I’ve seen in the field. What have you come across?

Originally posted on DataScienceCentral by Randal Scott King

Read more…

The Elements of Statistical Learning (Data Mining, Inference, and Prediction)

Hastie, Tibshirani and Friedman. Springer-Verlag.

 During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.

This book is available here




Read more…

The Only Skill you Should be Concerned With

Skills, skills, skills!!! Which ones should I learn? Which ones do I need to land the job, to impress the client, to prepare for the future, to stay relevant? What programming languages should I learn? What technologies should I master? What business books should I read? Is there a course I can take, or a certification I can enroll in? Should I focus on being a specialist to ensure I am always the "go-to person" despite commoditization, or should I concentrate on generalist skills so I can always see the forest for the trees? A mixture of both? Is there a Roadmap? A Bible? A Guru? Help!!! Look. Languages change, technologies evolve, and so-called experts come and go. Just when that awesome course ends something new pops up that the course didn't cover. Just when you became an R ninja, Python came around the corner and became the de facto. Just when you finally mastered how to lay out a kick-ass data pipeline using Hadoop, Spark became the new thing.
Read more…

R Tutorial for Beginners: A Quick Start-Up Kit

Learn R: A Statistical Programming Language Here's my quick start-up kit for you. Install R Linux: "sudo apt-get install r-base" should do it Windows: go get it here Open a Script Windows alongside the Console window when you run R It should look something like this. Your Console allows typing direct, hit and R runs the line. If it goes to prompt (the Red ">"), then that command processed. Console and Script windows Your script file is for typing in as much as you want. To run whatever is there, highlight what you want to run and hit Ctrl+R or the icon on top. It will run in the console. This basic setup is useful over to begin. The quickest approach is to go to the Appendix of the Intro Manual and walk though typing in all the commands to see how it basically works. You'll see quickly that you feed equations, functions, values, objects, etc. from the right to the named variable or object on the left using the " <- " characters.
Read more…
After getting oriented to the research problems of phenology, understanding data collection and storage, and discussing the statistical methods and approaches during the past few days of our expedition to Acadia National Park, we dug into solutions and designs on day four. Fundamentally, more complete and accurate data sets around bird migration, barnacle abundance, weather, duck population, and water resource data all help us understand the impact of climate change. Today’s effort was focused on the questions to seek answers to, the data sources to ingest, the models to build, and the visualizations to share with others, ultimately leading to a solution and approach.
Read more…
In this series, we provided an introduction to the project and cited specific technology improvements that could transform the way phenology is studied by using stationary camera networks and machine based image processing on big data sets and using big data platforms. With day one and two behind us, our team spent the day learning about current data archives, weather station sensors, data processing issues, current models used, and visualizations. Even though this week trip is only half over, here are very clear ways that technology can change the way science is practiced today, and I will share these concepts below.
Read more…
In the first post of this series, we gave the background on our data science expedition to Acadia National Park, and now we are seeing its transformative potential. As a representative from Pivotal and EMC, our goal is to help a team of phenology scientists improve the way they use big data platforms as well as data science tools and techniques to improve their research and fast-forward our understanding of climate change. In this post, I wanted to share what we experienced in the field for Day 2—actually collecting data on bird migration and aquatic life in tidal pools, as well as thinking about how to automate and improve the quality of these data collection processes. I’m happy to report, in just 2 days, we’ve begun formulating ways to use a network of stationary cameras, image processing technology, data lakes, and mobile apps to help automate the process—ultimately helping scientists spend more time on science and less time on administrative tasks.
Read more…
As data scientists, we get excited about using our talents to solve problems like global climate change and worldwide environmental policy. This week, I have the opportunity to represent Pivotal and team with other experts from EMC, Earthwatch, and Schoodic Institute to spend a week at Acadia National Park. We will be applying data science to the science of phenology—the study of periodic plant and animal life cycle events and how these are influenced by seasonal and inter-annual variations in climate. Ultimately, the work will help scientists and researchers to better collect, store, manage, and monitor data, helping us all understand how and why our climate is changing and what the impact is on plants, animals, and humans.
Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Data Science Jobs