In transitioning from just traditional time-series modeling to predictive analytics, one of the key aspects is utilizing different causal inputs in your forecasting. It is not just relying only on internal shipment data or order history but considering external factors and a multitude of variables that paint a more complete picture. 

Knowing how to integrate this new data and which ones to use can be intimidating and challenging – but if done well, it is rewarding and profitable.

One of the most significant new models I have seen in a while helps bridge the gap between data and insights, and turns multiple inputs into valuable forecasted outputs. It is a powerful model that even inexperienced forecasters and data scientists can use. Using this new methodology, you are almost certain to get the highest fitted forecast or r-squared with little effort or concern. This is a brief introduction into this new method.

It is called Auto Phantom Regression with Integrated Linear Forecasting Operation and Ordinary Least Squared Estimator.

Although the name of the model is long (I am sure they will eventually come up with a good acronym), the name really highlights exactly what it does. Imagine a scenario where you have many predictor variables, or don’t even know what variables exist and you’re not sure how to include them. Because there are so many predictor variables, you would like some help in creating a good model automatically. It will do this by trying and testing many phantom variables during the exploratory stages, building a regression forecast based on ordinary least squares to find the best fit.

The way it accomplishes this is by using a type of stepwise regression in that it selects a model by automatically, adding or removing individual real and dummy predictors, a step at a time based on their statistical significance. The end result of this process is a single regression model, which makes it nice and simple. What makes this model extra special is that every time the model adds or removes predictors based on a statistical test, you also invoke a phantom degree of freedom because you’re learning something from your data set but it does not show up as a degree of freedom.

These phantom degrees of freedom will not impact your number of observations per parameter estimate or affect the predicted R-squared. Instead, it allows the model to perform many statistical tests and try out many models based on actual and dummy variables until it finds a combination that appear to be significant and give you the highest r-squared.

Now there are some serious concerns and words of caution with this new revolutionary model. Firstly, when you try many different models and variables, you’re bound to find one that best fits the data but doesn’t fit reality. Second there is no magic bullet when it comes to building a model. And third, while there are components which are real and useful, I do not actually know of any Auto Phantom Regression with Integrated Linear Forecasting Operation and Ordinary Least Squared estimator model or, as it is known by its acronym, APRILFOOLS!

Lessons To Be Learned

You are training a model, not teaching it to memorize your data

You can learn a lot from trying different variables and bringing multiple data sets into your forecasting process. But be careful. When using regression models, a degree of freedom is a measure of how much you’ve learned. Your model uses these degrees of freedom with every variable that it estimates. If you use too many, you’re overfitting the model. The end result is that the regression coefficients, p-values, and R-squared can all be misleading and, while the model fits the data, it does not serve as a useful forecast.

Before throwing data about every potential predictor under the sun into your regression model, remember that it may not make it better. With regression, as with so many things in life, there comes a point where adding more is not better. In fact, sometimes not only does adding more factors to a regression model fail to make things clearer, it actually makes things harder to understand!

There is No Perfect Model

Yes, we absolutely need to start looking at various inputs to improve our forecast. There is still a place for conducting prior research into the important variables and their relationships to help you specify the best model. When you are using new variables, collect a large enough sample size to support the level of model complexity you need. Avoid data mining for what may work and keep track of how many phantom degrees of freedom you raise before arriving at your final model.

I don’t care if you are using a traditional time-series model or sophisticated machine learning algorithm, when you hear the words “best pick”, be cautious. If you evaluate your model on the same data you used to train it, your model could very easily have overfitting. To help avoid this, have a testing data set or time series hold out periods. This is part of the overall data set you set aside and use to provide an unbiased evaluation of a final model’s fit before putting it into production.

Finally, if a model seems too good to be true and over-sophisticated to the point that you do not completely understand it, it may not be the right model. There is no replacement for experience and knowledge, and learning what is right for your forecasting process.

For further information on choosing the right forecasting model, click here. Alternatively, pick up a copy of my new book, Predictive Analytics For Business Forecasting.

For insight into real predictive analytics models that identify meaningful casual relationships in your data, attend IBF’s Predictive Business Analytics, Forecasting & Planning Conference from April 20-22, 2021.