Originally posted on Data Science Central
Summary: If you’re making the decision to use NoSQL, how do you quantify the value of the investment?
If you are exploring NoSQL, once you become educated on the basics there are two questions that will rapidly move to the top of your list of considerations.
- What does it cost?
- What’s the dollar payoff?
The cost side is more easily addressed since you can gather the various cost elements of hardware and software and the additional costs of direct and indirect manpower and add these up. Less straightforward is the issue of estimating the dollar benefit.
In this article we’ll assume that you’re looking at the possibility of storing large quantities of data to supplement your existing transactional files. This could be geographic, RFID, sensor, text, or any of the other types of unstructured and semi-structured data for which NoSQL is ideal. In broad terms we’re talking about key-value-stores or document-oriented DBs, less about graph or columnar DBs. Think Hadoop, Mongo, Cloudera, or one of the other competitors in this space. What you’ve already figured out or at least strongly suspect is that this is a whole lot more data than you currently have probably by a factor of 10X to 1,000X, and it’s pretty clear that your RDBMS is not the place to put it.
There are two broad categories of benefit in NoSQL that each need to be considered separately. One is easy to quantify, the other one less so.
You’ve got to put the data somewhere and that hardware costs money. Your current RDBMS data warehouse and transactional systems reside on high reliability (expensive) servers and the cost of the RDBMS software can be equally as costly depending on which brand name you’re using.
The well-known benefit of NoSQL is that many of these like Hadoop and its variants are open source and therefore quite inexpensive compared to brand names like IBM, Oracle, SAP, and the like. Further, because of some unique architecting that we discussed in previous posts, NoSQL can safely be run on commodity hardware which is significantly less expensive than high reliability servers.
What does all this add up to dollar-wise? Brian Garrett, the Hadoop lead at SAS offers these approximate comparative numbers:
- $15,000 per terabyte to store data on an appliance.
- $5,000 per terabyte to store data on a SAN (storage area network)
- Less than $1,000 per terabyte to store data on an open source NoSQL DB like Hadoop.
The dollar savings from distributed NoSQL storage are pretty straightforward. The savings or value of distributed processing however depends a lot on your business and is a much tougher question. It’s particularly tough because you may need to value the benefit of types of analysis that your company isn’t or can’t do right now.
For example, let’s say your company is already using predictive analytics to guide profitable marketing campaigns. You’ve got your transactional data whipped into pretty good shape in a RDBMS data warehouse and have a team of analysts and data scientists on board and they’re doing a good job. What’s the added value of being able to analyze much more massive data sets?
There are two scenarios at work here. One is that you have a very large number of customers, say several million, and so far it’s been too cumbersome to run analytics across all of them at once. The second is that you have a smaller and manageable number of customers but you want to add say text or geographic data to make your analytics more accurate and therefore more profitable. Or you may have both of these at once.
The factor that works against you is time. In some companies with very large customer bases it can take over night to many days to extract or model against the whole data set. Here’s an example from financial services where their data scientists work to refine the best models to predict success of retail marketing campaigns.
Before adopting NoSQL this team was able to prepare about one iteration of their predictive model every 5 hours. Basically the I/O from a single source database was the bottleneck. NoSQLs like Hadoop, Mongo, or Cloudera however let you do a portion of the processing on each of the separate nodes then combine the results. This parallel processing makes analytics much faster.
After implementation they reduced the model iteration time to 6 minutes meaning that they could run 50 times as many experimental models in the same time that used to be required to do just one.
The second factor at work here is that combining the newly available unstructured data into their models almost doubled the lift (accuracy) of their models. The result of much higher throughput and the availability of wholly new types of data was that the productivity of this group sky rocketed and the average profitability based on their better refined and more informed models also increased dramatically.
There’s an important caveat here. The analytic platform that your data scientists are going to use to access and process the NoSQL data stores must be able to benefit from the distributed processing capabilities of the NoSQL DB. In the old model, data is extracted to a separate data store where the analytics take place. For big data, the I/O will kill efficiency. Many of the major analytic platforms like SAS, Alteryx, and Pivotal/Greenplum are specifically designed to move as much processing as possible back into the NoSQL database which is the only way to really benefit from this speed improvement.
In terms of finally estimating value, the storage cost savings can be straightforward to calculate. The benefit from the new data types and distributed processing however require that you have a good idea of how the data will be used, and to use current use cases as bench marks for this increase in efficiency and accuracy as it applies to your situation. That’s certainly more challenging but worth it to be able track the return from your project against your expectations.
October 17, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
The original blog can be viewed at: