If we can gain insights from just a small amount of internal structured data, then how much more could we glean from Big Data? I’m talking that external mass of structured and unstructured data that is just waiting to be collected and analyzed.
But there’s a balance between not enough data and too much. What’s the right amount of data to work with as demand planner or data scientist?
There is a debate about how much data is enough and how much data is too much. According to some, the rule of thumb is to think smaller and focus on quality over quantity. On the other hand, Viktor Mayer-Schönberger and Kenneth Cukier explained in their book Big Data: A Revolution That Will Transform How We Live, Work, and Think, that “When data was sparse, every data point was critical, and thus great care was taken to avoid letting any point bias the analysis. However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming.”
The obsession with exactness is an artifact of the information-deprived analog era.
Of course, larger datasets are more likely to have errors, and analysts don’t always have time to carefully clean each and every data point. Mayer-Schönberger and Cukier have an intriguing response to this problem, saying that “moving into a world of big data will require us to change our thinking about the merits of exactitude. The obsession with exactness is an artifact of the information-deprived analog era.”
Supporting this idea, some studies in data science have found that even massive, error-prone datasets can be more reliable than simple and smaller samples. The question is, therefore, are we willing to sacrifice some accuracy in return for learning more?
Like so many things in demand planning and predictive analytics, one size does not always fit all. You need to understand your business problem, understand your resources, and understand the trade-offs. There is no rule about how much data you need for your predictive modeling problem.
The amount of data you need ultimately depends on a variety of factors:
The Complexity Of The Business Problem You’re Solving
Not necessarily the computational complexity, (although this an important consideration). How important is precision verses information? You should define this business problem and then select the closest possible data to achieve that goal. For example, if you want to forecast the future sales of a particular item, the historical sales of that item may be the closest to that goal. From there, other drivers that may contribute to future sales or understanding past sales should be next. Attributes that have no correlation to the problem are not needed.
The Complexity Of The Algorithm
How many samples are needed to demonstrate performance or to train the model? For some linear algorithms, you may find you can achieve good performance with a hundred or few dozen examples per class. For some machine learning algorithms, you may need hundreds or even thousands of examples per class. This is true of nonlinear algorithms like random forest or an artificial neural network. In fact, some algorithms like deep learning methods can continue to improve in skill as you give them more data.
How Much Data Is Available
Are the data’s volume, velocity, or variety beyond your company’s ability to store, or process, or use it? A great starting point is working with what is available and manageable. What kind of data do you already have? In Business-to-Business, most companies are in possession of customer records or sales transactions. These datasets usually come from CRM and ERP systems. A lot of companies are already collecting or beginning to collect third party data in the form of POS data. From here, consider other sources, both internal and external, that can add value or insights.
This does not solve the debate and the right amount of data is still unknowable. Your goal should be to continue to think big and work with what you have, gather the data you need for the problem and algorithm you have.
When it comes to gathering data, it is like the best time to plant a tree was ten years ago. Focus on the data available and the insights you have today while building the roadmap and capabilities you want to achieve in the future. Even though you may not use it now, don’t wait until tomorrow to start collecting what you may need for tomorrow.