This Website contains complementary material to the manuscript:
Christoph Bergmeir and José M. Benítez. On the Use of Cross-validation for Time Series Predictor Evaluation

__________________________________________________________________

Abstract

In time series predictor evaluation, we observe that with respect to the model selection procedure there is a gap between evaluation of traditional forecasting procedures, on the one hand, and evaluation of machine learning techniques on the other hand. In traditional forecasting, it is common practice to reserve a part from the end of each time series for testing, and to use the rest of the series for training. Thus it is not made full use of the data, but theoretical problems with respect to temporal evolutionary effects and dependencies within the data as well as practical problems regarding missing values are eliminated. On the other hand, when evaluating machine learning and other regression methods used for time series forecasting, often cross-validation is used for evaluation, paying little attention to the fact that those theoretical problems invalidate the fundamental assumptions of cross-validation. To close this gap and examine the consequences of different model selection procedures in practice, we have developed a rigorous and extensive empirical study. Six different model selection procedures, based on (i) cross-validation and (ii) evaluation using the series’ last part, are used to assess the performance of four machine learning and other regression techniques on synthetic and real-world time series. No practical consequences of the theoretical flaws were found during our study, but the use of cross-validation techniques led to a more robust model selection. To make use of the “best of both worlds”, we suggest that the use of a blocked form of cross-validation for time series evaluation became the standard procedure, thus using all available information and circumventing the theoretical problems.
_______________________________________________________________________________________________________

Contents

1 Design of the Experiments
 1.1 Applied Models and Algorithms
 1.2 Benchmarking Data
 1.3 Data Preparation and Partitioning
 1.4 Compared Model Selection Procedures
2 Statistical Evaluation
3 Plots of the Results
 3.1 Plots for scenario (AS1)
  3.1.1 (AS1) Box Plots
  3.1.2 (AS1) Point Plots Combined
  3.1.3 (AS1) Point Plots per Error Measure
 3.2 Plots for scenario (AS2)
  3.2.1 (AS2) Box Plots
  3.2.2 (AS2) Point Plots Combined
  3.2.3 (AS2) Point Plots per Error Measure
 3.3 Plots for scenario (AS3)
  3.3.1 (AS3) Box Plots
  3.3.2 (AS3) Point Plots Combined
  3.3.3 (AS3) Point Plots per Error Measure

1. Design of the Experiments

Our study has the following objectives:

In order to cover a broad amount of application situations, the experiments on different model selection procedures will be carried out using machine learning and general regression methods, synthetic and real-world datasets, and various error measures.

1.1. Applied Models and Algorithms

All methods used are available in packages within the statistical computing language R. We use the implementation of an epsilon support vector regression algorithm with a radial kernel from the LIBSVM (that is wrapped in R by the package e1071), and employ a multi-layer perceptron of the nnet package. Furthermore, we use lasso regression from the package lars, and the linear fit model present in the R base package. In the following, the methods will be called svmRadial, nnet, lasso, and lm.

The methods are applied using the following parameter grids:





size decay



1 30.00316
2 50.00316
3 90.00316
4 30.01470
5 50.01470
6 90.01470
7 30.10000
8 50.10000
9 90.10000




Table 1: Parameter grid for the neural network method.





costgamma



1 0.10 0.00
2 1.00 0.00
3 10.00 0.00
4 100.00 0.00
51000.00 0.00
6 0.10 0.00
7 1.00 0.00
8 10.00 0.00
9 100.00 0.00
101000.00 0.00
11 0.10 0.01
12 1.00 0.01
13 10.00 0.01
14 100.00 0.01
151000.00 0.01
16 0.10 0.20
17 1.00 0.20
18 10.00 0.20
19 100.00 0.20
201000.00 0.20




Table 2: Parameter grid for the support vector regression




fraction


1 0.10
2 0.36
3 0.63
4 0.90



Table 3: Parameter grid for the lasso regression

1.2. Benchmarking Data

We used synthetic and real-world data. All data is available in the KEEL-dataset repository (http://sci2s.ugr.es/keel/timeseries.php). The synthetic data is both linear and non-linear. The real-world data is taken from the Santa Fe forecasting competition (http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html) and the NNGC1 (http://www.neural-forecasting-competition.com/datasets.htm) competition.

The following three scenarios were analyzed within the study:

1.3. Data Preparation and Partitioning

We withhold from every series a percentage pds of values as “unknown future” for validation. In the following, we will call this dataset the validation set, the out-of-sample set, or shortly the out-set, as it is completely withhold from all other model building and model selection processes. The remaining data accordingly will in the following be called the in-set. Throughout our experiments, we use pds = 0.8, i.e., 20 percent of the data is used as out-set.

The in-set data is used for model building and model selection in the following way: the lags lds to be used for forecasting are chosen. For synthetic series, the lags are known, as they were specified during data generation, for the real-world data they have to be estimated. To have the possibility to use all model selection procedures (especially the procedure that removes dependent values), four lags are used for real-world series. This seems feasible, as the focus of this study does not lie in the actual performance of the methods, but in the performance of the model selection procedures.

1.4. Compared Model Selection Procedures

We consider six different model selection strategies:

CV
standard 5-fold cross-validation is applied, i.e., the embedded data is partitioned randomly into five sets, and within five turns every set is used as test set, while the other sets are used for training.
blockedCV
5-fold cross-validation is applied on data that is not partitioned randomly, but sequentially into five sets.
noDepCV
5-fold cross-validation without the dependent values is applied: the sets generated for CV are used, but according to the lags used for embedding, dependent values are removed from the training set. As stated earlier, depending on the lags used a lot of values have to be removed, so that this model selection procedure only can be applied if the amount of lags is not large compared to the number of cross-validation subsets, i.e., in AS1 and AS3.
lastBlock
Only the last set of blockedCV is used for evaluation.
secondBlock
Not the last but the second block of blockedCV is used for evaluation.
secondCV
Only the second subset of CV is used.

2. Statistical Evaluation

We perform an analysis of the medians and their differences in Tab.4. The table shows that with respect to under- or overestimation of the error, no difference between the model selection procedures can be found. The differences in the accuracy with which the in-set error predicts the out-set error are small, and vary with the characteristics of the data and the error measures. So, e.g., choice of the error measure seems more relevant than choice of the model selection procedure. Tab. 5 shows results of the Fligner-Killeen test, which is used to determine if the distributions differ in their dispersion. Though the difference between lastBlock and the cross-validation procedures is not always statistically significant, the table clearly shows the trend that the difference between the last block evaluation and the cross-validation methods is bigger than difference among the cross-validation methods.










CV bCV lBCV-lBCV-bCVbCV-lB








MSE AS1 0.023 0.030 0.011 0.012 -0.007 0.019
AS2 0.042 0.055 0.035 0.007 -0.013 0.020
AS3 0.122 0.084 0.123 -0.001 0.038 -0.039








RMSE AS1 0.012 0.015 0.006 0.006 -0.003 0.010
AS2 0.022 0.027 0.017 0.005 -0.005 0.010
AS3 0.061 0.046 0.060 0.002 0.015 -0.013








SSE AS1 0.284 0.291 0.274 0.010 -0.007 0.017
AS2 0.249 0.255 0.237 0.012 -0.006 0.018
AS3 0.410 0.380 0.459 -0.048 0.030 -0.079








MAE AS1 0.011 0.014 0.012 -0.001 -0.003 0.002
AS2 0.028 0.033 0.034 -0.006 -0.005 -0.001
AS3 0.065 0.043 0.045 0.020 0.022 -0.002








MDAE AS1-0.023-0.013 0.015 0.008 0.010 -0.002
AS2 0.053 0.052 0.085 -0.032 0.002 -0.033
AS3 0.055 0.048 0.044 0.011 0.007 0.004








MAPE AS1-0.044-0.055 0.082 -0.039 -0.011 -0.027
AS2-0.245-0.268-0.125 0.121 -0.023 0.144
AS3








MDAPE AS1-0.012-0.002 0.020 -0.007 0.011 -0.018
AS2-0.071-0.070-0.039 0.033 0.001 0.031
AS3-0.028-0.033-0.012 0.017 -0.005 0.022








SMAPE AS1-0.309-0.346-0.279 0.030 -0.037 0.066
AS2-0.292-0.215-0.719 -0.427 0.077 -0.504
AS3-0.755-1.578-0.013 0.741 -0.823 1.564








SMDAPEAS1-0.013-0.005 0.048 -0.035 0.008 -0.044
AS2 0.407 0.705 0.126 0.281 -0.298 0.579
AS3-0.723-0.431-0.003 0.720 0.292 0.428








MRAE AS1-0.326-0.326-0.209 0.117 0.000 0.117
AS2-0.289-0.288-0.094 0.195 0.000 0.194
AS3








MDRAE AS1-0.060-0.060-0.059 0.001 0.000 0.001
AS2-0.074-0.081-0.059 0.014 -0.007 0.021
AS3 0.065 0.041 0.060 0.005 0.025 -0.020








GMRAE AS1-0.089-0.085-0.062 0.027 0.004 0.023
AS2-0.073-0.071-0.059 0.015 0.002 0.013
AS3








RELMAEAS1-0.067-0.062-0.069 -0.002 0.005 -0.007
AS2-0.072-0.075-0.098 -0.026 -0.003 -0.023
AS3 0.022 0.003 0.005 0.017 0.019 -0.002








RELMSE AS1-0.096-0.090-0.118 -0.022 0.006 -0.028
AS2-0.137-0.160-0.181 -0.044 -0.023 -0.021
AS3-0.006-0.013 0.009 -0.004 -0.007 0.003









Table 4: Medians and differences in the median. The columns are: CV, bCV, and lB: Median of (Eout-set∕Ein-set) values for the procedures CV, blockedCV, and lastBlock, diminished by one. The optimal ratio of the errors is one (which would result in a zero in the table), as then the in-set error equals the out-set error, and hence is a good estimate. Negative values in the table indicate a greater in-set error, i.e., the out-set error is overestimated. A positive value, on the contrary, indicates underestimation. CV-lB, CV-bCV, and bCV-lB: differences of the absolute values of CV, bCV, and lB. A negative value indicates that the minuend in the difference leads to a value nearer to one, that is, to a better estimate of the error.









allCV,lB,bCVCV,lBbCV,lBCV,bCV







MSE AS10.000 0.716 0.479 0.566 0.666
AS20.000 0.154 0.243 0.058 0.469
AS30.077 0.015 0.038 0.007 0.469







RMSE AS10.000 0.675 0.448 0.508 0.707
AS20.000 0.129 0.230 0.047 0.433
AS30.078 0.015 0.028 0.008 0.644







SSE AS10.000 0.880 0.662 0.758 0.760
AS20.000 0.469 0.683 0.230 0.413
AS30.082 0.018 0.039 0.009 0.508







MAE AS10.000 0.000 0.000 0.000 0.264
AS20.000 0.000 0.003 0.000 0.430
AS30.037 0.044 0.057 0.026 0.661







MDAE AS10.000 0.000 0.000 0.000 0.041
AS20.000 0.000 0.000 0.000 0.902
AS30.003 0.067 0.055 0.043 0.923







MAPE AS10.003 0.021 0.033 0.011 0.650
AS20.004 0.060 0.080 0.027 0.709
AS30.133 0.673 0.414 0.681 0.950







MDAPE AS10.000 0.005 0.018 0.004 0.444
AS20.108 0.716 0.943 0.424 0.529
AS30.005 0.008 0.007 0.013 0.784







SMAPE AS10.394 0.601 0.769 0.410 0.376
AS20.338 0.453 0.571 0.561 0.204
AS30.006 0.003 0.005 0.001 0.913







SMDAPEAS10.194 0.843 0.737 0.543 0.935
AS20.450 0.235 0.121 0.159 0.795
AS30.000 0.000 0.000 0.000 0.351







MRAE AS10.012 0.007 0.007 0.008 0.849
AS20.000 0.000 0.000 0.000 0.623
AS30.455 0.273 0.258 0.136 0.703







MDRAE AS10.000 0.009 0.020 0.007 0.701
AS20.001 0.548 0.346 0.369 0.983
AS30.054 0.302 0.256 0.163 0.737







GMRAE AS10.000 0.000 0.001 0.000 0.324
AS20.000 0.002 0.006 0.002 0.732
AS30.001 0.164 0.138 0.101 0.723







RELMAEAS10.000 0.000 0.002 0.000 0.307
AS20.010 0.149 0.119 0.087 0.849
AS30.005 0.097 0.115 0.050 0.618







RELMSE AS10.000 0.000 0.000 0.000 0.650
AS20.001 0.330 0.191 0.234 0.828
AS30.001 0.022 0.018 0.020 0.934








Table 5: p-values of the Fligner test. First column: Fligner test for differences in variance, applied to the group of all model selection procedures (6 procedures for AS1 and AS3, and 5 procedures for AS2). Second column: Test for the three mentioned methods. Columns 3-5: Tests of interesting pairs of methods (without application of a post-hoc procedure).

3. Plots of the Results

The in-set error, estimated by the model selection procedure, is compared to the error on the out-set. If the model selection procedure produces a good estimate for the error the two errors should be very similar. Therefore, we analyze plots of points (Ein-set,Eout-set). If the errors are equal, these points all lie on a line with origin zero and gradient one. In the following, we call this type of evaluation point plots.

Additionally to the point plots, we analyze box-and-whisker plots containing directly the value of the quotient (Eout-set∕Ein-set), which is especially interesting with the use of scale-dependent measures like the RMSE, as with using the quotient a normalization takes place, so that the results are comparable.

Section 3.1.2 shows point plots for scenario AS1, using different error measures. It can be observed that the RELMAE yields a less scattered distribution than MDAPE and MDRAE. The in-set error tends to overestimate the out-set error, especially when using relative measures, i.e., RELMAE or MDRAE. No systematical difference between different model selection procedures can be determined in this plot. To further examine the different model selection porcedures, section. 3.1.3 shows the results of section 3.1.2 in more detail, with the results of every model selection procedure in a different plot. section 3.1.1 shows the results of section. 3.1.3 as a box plot. Within the scenario AS2, noDepCV is not applicable any more. Point plots and box plots analogous to the plots of scenario AS1 are shown in section 3.2. Section 3.3 shows the results of scenario AS3, where real-world data is used.

3.1. Plots for scenario (AS1)

3.1.1. (AS1) Box Plots


PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC


3.1.2. (AS1) Point Plots Combined


PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC


3.1.3. (AS1) Point Plots per Error Measure


PIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC


3.2. Plots for scenario (AS2)

3.2.1. (AS2) Box Plots


PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC


3.2.2. (AS2) Point Plots Combined


PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC


3.2.3. (AS2) Point Plots per Error Measure


PIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC



PICPIC
PICPIC
PIC



PIC
PICPIC
PICPIC


3.3. Plots for scenario (AS3)

3.3.1. (AS3) Box Plots


PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC


3.3.2. (AS3) Point Plots Combined


PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC
PICPIC


3.3.3. (AS3) Point Plots per Error Measure


PIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC



PICPIC
PICPIC
PICPIC