When it comes to forecasting models, there is no one size fits all. And with different inputs, methods, and parameters, how do you choose the right one, and once you have a model, how do you evaluate it?
Predictive modeling works on a feedback principle. Planners build a model, get feedback, and make improvements and continue until we achieve the necessary outcome. An important aspect of any predictive model or process is the evaluation aspect and the ability to differentiate models and their results.
There are two ways to evaluate a predictive model:
- Examine its inputs and fit
- Examine its outputs or uncertainty.
Evaluating inputs
Evaluating the inputs is a backward-looking assessment seeing which model fits or learns the best. It looks at how the model generates a prediction before it occurs and describes the relative difference between training or historical data and a hypothetical forecast generated by the model using that same training data. The primary reasons for testing inputs are to learn how to improve a given model and, in the case of causal models, to better assess the effects of policy changes.
Evaluating outputs
Evaluating the outputs is assessing the model output against the actual results that occur and the amount of uncertainty or error. This is weighting and then measuring the results of a prediction against a future observation. The type of error measurement you use depends on the type of model you’re evaluating. The major reasons for testing outputs are to determine the precision of a model, and to assess uncertainty.
The model should achieve your desired outputs and deliver good prediction performance, and balance the trade-off between the chosen models and the way you choose to configure them. A general rule is that, as a model tries to match data points more closely, or when a more flexible method is used, the bias reduces but variance increases.
Unfortunately, the model with the best fit does not always give you the best results. A model can perform well and be very flexible on the trained dataset but does not do well on actual observations or on a dataset that it is not trained on. This is overfitting. On the other hand, if the model is too simple and does not capture the complexity of data, it is underfitting.
The Goldilocks Zone
The goldilocks zone becomes that sweet spot between the complexity and increase of bias, and the flexibility and increase of variance. Generalization is the balancing act, where you shift between models with high bias and those with high variance. This becomes the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset.
The skill of the model at making predictions determines the quality of the generalization and can help as a guide during the model selection process. It’s always a good idea to try as many models as time and resources permit. The problem then becomes one of selecting the best model for the desired prediction task. Out of the millions of possible models, you should prefer simpler models over complex ones.
Splitting your data and having a Hold-Out process is undoubtedly the simplest model evaluation technique. We take our dataset and split it into two parts: A training set and a test set. After we generate a prediction from the training set, we test a model based on test set data that it has never been exposed to before to see if we get similar results. The typical procedure for this test is to set aside 10% to 30% of randomly chosen data or most recent data and leave it untouched until a model is built and ready to be deployed. Be careful though because if you keep tweaking your model based on the same hold out data then you may be lulled into using the test data to train your model and overfitting the model there without realizing it.
5 Tips To Avoid Under & Over Fitting Forecast Models
In addition to that, remember these 5 tips to help minimize bias and variance and reduce over and under fitting.
1. Use a resampling technique to estimate model accuracy
In machine learning, the most popular resampling technique is k-fold cross validation. The approach is to split the historical data into training, validation, test data, develop a model, and validate performance on k-folds (or splits) of training/validation data. This allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.
2. Regularization
Regularization refers to a broad range of techniques for artificially forcing your model to be simpler. The method will depend on the type of learner you’re using. For example, you could prune a decision tree, use dropout on a neural network, or add a penalty parameter to the cost function in regression. Oftentimes, the regularization method is a hyperparameter as well, which means it can be tuned through cross-validation.
3. Use more data
Many times, with time series or even many machine learning algorithms, adding or training with more data can help algorithms detect the signal better. Of course, caution should be taken here. Adding extra points to the data typically reduces overfitting problems, but if you start adding extra dimensions to the data, then you are likely to end-up with overfitting problems even if the models themselves stay unchanged.
4. Focus on adding and removing features
Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. For models that do not have built in feature selection, you can manually improve their generalizability by removing irrelevant input features. An interesting way to do so is to tell a story about how each feature fits into the model. If something doesn’t make sense, or if it’s hard to justify certain features, this is a good way to identify them.
5. Know when enough is enough and early stopping
Early stopping refers stopping the training process before the learner passes that point. When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs. Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken. In the simplest case, training is stopped as soon as the performance on the validation dataset decreases as compared to the performance on the validation dataset at the prior training period.
Eric will reveal how to update your S&OP process to incorporate predictive analytics to adapt to the changing retail landscape at IBF’s Business Planning, Forecasting & S&OP Conferences in Orlando (Oct 20-23) and Amsterdam (Nov 20-22). Join Eric and a host of forecasting, planning and analytics leaders for unparalleled learning and networking.