This Website contains complementary material to the
manuscript:
Christoph Bergmeir and José M. Benítez. On the Use
of Cross-validation for Time Series Predictor Evaluation
__________________________________________________________________
Abstract
In time series predictor evaluation, we observe that with respect to the model
selection procedure there is a gap between evaluation of traditional forecasting
procedures, on the one hand, and evaluation of machine learning techniques on the
other hand. In traditional forecasting, it is common practice to reserve a part from
the end of each time series for testing, and to use the rest of the series for training.
Thus it is not made full use of the data, but theoretical problems with respect to
temporal evolutionary effects and dependencies within the data as well as
practical problems regarding missing values are eliminated. On the other hand,
when evaluating machine learning and other regression methods used for
time series forecasting, often cross-validation is used for evaluation, paying
little attention to the fact that those theoretical problems invalidate the
fundamental assumptions of cross-validation. To close this gap and examine
the consequences of different model selection procedures in practice, we
have developed a rigorous and extensive empirical study. Six different model
selection procedures, based on (i) cross-validation and (ii) evaluation using the
series’ last part, are used to assess the performance of four machine learning
and other regression techniques on synthetic and real-world time series.
No practical consequences of the theoretical flaws were found during our
study, but the use of cross-validation techniques led to a more robust model
selection. To make use of the “best of both worlds”, we suggest that the use of a
blocked form of cross-validation for time series evaluation became the standard
procedure, thus using all available information and circumventing the theoretical
problems.
_______________________________________________________________________________________________________
Contents
1. Design of the Experiments
Our study has the following objectives:
- To determine if dependency within the data has effects on the
cross-validation, e.g., in the way, that the cross-validation procedure
systematically underestimates the error. This can be done by comparing
randomly chosen evaluation sets to blocked sets.
- To determine if effects of temporal evolution can be found, by comparing
evaluations that use data from the end of the series to evaluations that use
data from somewhere in between. It has to be noted, that we will consider
in this study only (second-order) stationary series
- To determine if cross-validation yields a more robust error measure, by
making full use of the data in the way that it uses all data for training
and testing.
In order to cover a broad amount of application situations, the experiments on
different model selection procedures will be carried out using machine learning and
general regression methods, synthetic and real-world datasets, and various error
measures.
1.1. Applied Models and Algorithms
All methods used are available in packages within the statistical computing language
R. We use the implementation of an epsilon support vector regression algorithm with
a radial kernel from the LIBSVM (that is wrapped in R by the package e1071), and
employ a multi-layer perceptron of the nnet package. Furthermore, we use lasso
regression from the package lars, and the linear fit model present in the R base
package. In the following, the methods will be called svmRadial, nnet, lasso, and
lm.
The methods are applied using the following parameter grids:
|
|
| | size | decay |
|
|
| 1 | 3 | 0.00316 | 2 | 5 | 0.00316 |
3 | 9 | 0.00316 | 4 | 3 | 0.01470 |
5 | 5 | 0.01470 |
6 | 9 | 0.01470 |
7 | 3 | 0.10000 |
8 | 5 | 0.10000 |
9 | 9 | 0.10000 |
|
|
| |
Table 1: | Parameter grid for the neural network method. |
|
|
|
| | cost | gamma |
|
|
| 1 | 0.10 | 0.00 | 2 | 1.00 | 0.00 |
3 | 10.00 | 0.00 | 4 | 100.00 | 0.00 |
5 | 1000.00 | 0.00 | 6 | 0.10 | 0.00 |
7 | 1.00 | 0.00 | 8 | 10.00 | 0.00 |
9 | 100.00 | 0.00 | 10 | 1000.00 | 0.00 |
11 | 0.10 | 0.01 |
12 | 1.00 | 0.01 |
13 | 10.00 | 0.01 |
14 | 100.00 | 0.01 |
15 | 1000.00 | 0.01 |
16 | 0.10 | 0.20 |
17 | 1.00 | 0.20 |
18 | 10.00 | 0.20 |
19 | 100.00 | 0.20 |
20 | 1000.00 | 0.20 |
|
|
| |
Table 2: | Parameter grid for the support vector regression |
|
|
| | fraction |
|
| 1 | 0.10 |
2 | 0.36 |
3 | 0.63 |
4 | 0.90 |
|
| |
Table 3: | Parameter grid for the lasso regression |
|
1.2. Benchmarking Data
We used synthetic and real-world data. All data is available in the KEEL-dataset
repository (http://sci2s.ugr.es/keel/timeseries.php). The synthetic data is both
linear and non-linear. The real-world data is taken from the Santa Fe forecasting
competition (http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html)
and the NNGC1 (http://www.neural-forecasting-competition.com/datasets.htm)
competition.
The following three scenarios were analyzed within the study:
- For application scenario one (AS1), synthetic data with significant
autocorrelations in only few and small lags (between one and five) were
considered.
- To analyze behavior of the methods on time series that have
autocorrelations in more and larger lags, application scenario two (AS2)
uses synthetic data with autocorrelations in the last 10 to 30 lags.
- Application scenario three (AS3), the real world data was considered (with
using four lags).
1.3. Data Preparation and Partitioning
We withhold from every series a percentage pds of values as “unknown future” for
validation. In the following, we will call this dataset the validation set, the
out-of-sample set, or shortly the out-set, as it is completely withhold from all other
model building and model selection processes. The remaining data accordingly will in
the following be called the in-set. Throughout our experiments, we use pds = 0.8, i.e.,
20 percent of the data is used as out-set.
The in-set data is used for model building and model selection in the following
way: the lags lds to be used for forecasting are chosen. For synthetic series, the lags
are known, as they were specified during data generation, for the real-world data they
have to be estimated. To have the possibility to use all model selection procedures
(especially the procedure that removes dependent values), four lags are used for
real-world series. This seems feasible, as the focus of this study does not lie in the
actual performance of the methods, but in the performance of the model selection
procedures.
1.4. Compared Model Selection Procedures
We consider six different model selection strategies:
-
CV
- standard 5-fold cross-validation is applied, i.e., the embedded data is
partitioned randomly into five sets, and within five turns every set is used
as test set, while the other sets are used for training.
-
blockedCV
- 5-fold cross-validation is applied on data that is not partitioned
randomly, but sequentially into five sets.
-
noDepCV
- 5-fold cross-validation without the dependent values is applied: the sets
generated for CV are used, but according to the lags used for embedding,
dependent values are removed from the training set. As stated earlier,
depending on the lags used a lot of values have to be removed, so that this
model selection procedure only can be applied if the amount of lags is not
large compared to the number of cross-validation subsets, i.e., in AS1 and
AS3.
-
lastBlock
- Only the last set of blockedCV is used for evaluation.
-
secondBlock
- Not the last but the second block of blockedCV is used for
evaluation.
-
secondCV
- Only the second subset of CV is used.
2. Statistical Evaluation
We perform an analysis of the medians and their differences in Tab.4. The table
shows that with respect to under- or overestimation of the error, no difference
between the model selection procedures can be found. The differences in the accuracy
with which the in-set error predicts the out-set error are small, and vary with the
characteristics of the data and the error measures. So, e.g., choice of the error
measure seems more relevant than choice of the model selection procedure. Tab. 5
shows results of the Fligner-Killeen test, which is used to determine if the
distributions differ in their dispersion. Though the difference between lastBlock and
the cross-validation procedures is not always statistically significant, the table clearly
shows the trend that the difference between the last block evaluation and the
cross-validation methods is bigger than difference among the cross-validation
methods.
|
|
|
|
|
|
|
| | | CV | bCV | lB | CV-lB | CV-bCV | bCV-lB |
|
|
|
|
|
|
|
| MSE | AS1 | 0.023 | 0.030 | 0.011 | 0.012 | -0.007 | 0.019 |
| AS2 | 0.042 | 0.055 | 0.035 | 0.007 | -0.013 | 0.020 | | AS3 | 0.122 | 0.084 | 0.123 | -0.001 | 0.038 | -0.039 |
|
|
|
|
|
|
|
| RMSE | AS1 | 0.012 | 0.015 | 0.006 | 0.006 | -0.003 | 0.010 | | AS2 | 0.022 | 0.027 | 0.017 | 0.005 | -0.005 | 0.010 |
| AS3 | 0.061 | 0.046 | 0.060 | 0.002 | 0.015 | -0.013 |
|
|
|
|
|
|
|
| SSE | AS1 | 0.284 | 0.291 | 0.274 | 0.010 | -0.007 | 0.017 |
| AS2 | 0.249 | 0.255 | 0.237 | 0.012 | -0.006 | 0.018 | | AS3 | 0.410 | 0.380 | 0.459 | -0.048 | 0.030 | -0.079 |
|
|
|
|
|
|
|
| MAE | AS1 | 0.011 | 0.014 | 0.012 | -0.001 | -0.003 | 0.002 | | AS2 | 0.028 | 0.033 | 0.034 | -0.006 | -0.005 | -0.001 |
| AS3 | 0.065 | 0.043 | 0.045 | 0.020 | 0.022 | -0.002 |
|
|
|
|
|
|
|
| MDAE | AS1 | -0.023 | -0.013 | 0.015 | 0.008 | 0.010 | -0.002 |
| AS2 | 0.053 | 0.052 | 0.085 | -0.032 | 0.002 | -0.033 | | AS3 | 0.055 | 0.048 | 0.044 | 0.011 | 0.007 | 0.004 |
|
|
|
|
|
|
|
| MAPE | AS1 | -0.044 | -0.055 | 0.082 | -0.039 | -0.011 | -0.027 | | AS2 | -0.245 | -0.268 | -0.125 | 0.121 | -0.023 | 0.144 |
| AS3 | – | – | – | – | – | – |
|
|
|
|
|
|
|
| MDAPE | AS1 | -0.012 | -0.002 | 0.020 | -0.007 | 0.011 | -0.018 |
| AS2 | -0.071 | -0.070 | -0.039 | 0.033 | 0.001 | 0.031 | | AS3 | -0.028 | -0.033 | -0.012 | 0.017 | -0.005 | 0.022 |
|
|
|
|
|
|
|
| SMAPE | AS1 | -0.309 | -0.346 | -0.279 | 0.030 | -0.037 | 0.066 |
| AS2 | -0.292 | -0.215 | -0.719 | -0.427 | 0.077 | -0.504 |
| AS3 | -0.755 | -1.578 | -0.013 | 0.741 | -0.823 | 1.564 |
|
|
|
|
|
|
|
| SMDAPE | AS1 | -0.013 | -0.005 | 0.048 | -0.035 | 0.008 | -0.044 |
| AS2 | 0.407 | 0.705 | 0.126 | 0.281 | -0.298 | 0.579 |
| AS3 | -0.723 | -0.431 | -0.003 | 0.720 | 0.292 | 0.428 |
|
|
|
|
|
|
|
| MRAE | AS1 | -0.326 | -0.326 | -0.209 | 0.117 | 0.000 | 0.117 |
| AS2 | -0.289 | -0.288 | -0.094 | 0.195 | 0.000 | 0.194 |
| AS3 | – | – | – | – | – | – |
|
|
|
|
|
|
|
| MDRAE | AS1 | -0.060 | -0.060 | -0.059 | 0.001 | 0.000 | 0.001 |
| AS2 | -0.074 | -0.081 | -0.059 | 0.014 | -0.007 | 0.021 |
| AS3 | 0.065 | 0.041 | 0.060 | 0.005 | 0.025 | -0.020 |
|
|
|
|
|
|
|
| GMRAE | AS1 | -0.089 | -0.085 | -0.062 | 0.027 | 0.004 | 0.023 |
| AS2 | -0.073 | -0.071 | -0.059 | 0.015 | 0.002 | 0.013 |
| AS3 | – | – | – | – | – | – |
|
|
|
|
|
|
|
| RELMAE | AS1 | -0.067 | -0.062 | -0.069 | -0.002 | 0.005 | -0.007 |
| AS2 | -0.072 | -0.075 | -0.098 | -0.026 | -0.003 | -0.023 |
| AS3 | 0.022 | 0.003 | 0.005 | 0.017 | 0.019 | -0.002 |
|
|
|
|
|
|
|
| RELMSE | AS1 | -0.096 | -0.090 | -0.118 | -0.022 | 0.006 | -0.028 |
| AS2 | -0.137 | -0.160 | -0.181 | -0.044 | -0.023 | -0.021 |
| AS3 | -0.006 | -0.013 | 0.009 | -0.004 | -0.007 | 0.003 |
|
|
|
|
|
|
|
| |
Table 4: | Medians and differences in the median. The columns are: CV, bCV,
and lB: Median of (Eout-set∕Ein-set) values for the procedures CV, blockedCV,
and lastBlock, diminished by one. The optimal ratio of the errors is one
(which would result in a zero in the table), as then the in-set error equals
the out-set error, and hence is a good estimate. Negative values in the table
indicate a greater in-set error, i.e., the out-set error is overestimated. A positive
value, on the contrary, indicates underestimation. CV-lB, CV-bCV, and bCV-lB:
differences of the absolute values of CV, bCV, and lB. A negative value indicates
that the minuend in the difference leads to a value nearer to one, that is, to a
better estimate of the error. |
|
|
|
|
|
|
|
| | | all | CV,lB,bCV | CV,lB | bCV,lB | CV,bCV |
|
|
|
|
|
|
| MSE | AS1 | 0.000 | 0.716 | 0.479 | 0.566 | 0.666 | | AS2 | 0.000 | 0.154 | 0.243 | 0.058 | 0.469 |
| AS3 | 0.077 | 0.015 | 0.038 | 0.007 | 0.469 |
|
|
|
|
|
|
| RMSE | AS1 | 0.000 | 0.675 | 0.448 | 0.508 | 0.707 |
| AS2 | 0.000 | 0.129 | 0.230 | 0.047 | 0.433 | | AS3 | 0.078 | 0.015 | 0.028 | 0.008 | 0.644 |
|
|
|
|
|
|
| SSE | AS1 | 0.000 | 0.880 | 0.662 | 0.758 | 0.760 | | AS2 | 0.000 | 0.469 | 0.683 | 0.230 | 0.413 |
| AS3 | 0.082 | 0.018 | 0.039 | 0.009 | 0.508 |
|
|
|
|
|
|
| MAE | AS1 | 0.000 | 0.000 | 0.000 | 0.000 | 0.264 |
| AS2 | 0.000 | 0.000 | 0.003 | 0.000 | 0.430 | | AS3 | 0.037 | 0.044 | 0.057 | 0.026 | 0.661 |
|
|
|
|
|
|
| MDAE | AS1 | 0.000 | 0.000 | 0.000 | 0.000 | 0.041 | | AS2 | 0.000 | 0.000 | 0.000 | 0.000 | 0.902 |
| AS3 | 0.003 | 0.067 | 0.055 | 0.043 | 0.923 |
|
|
|
|
|
|
| MAPE | AS1 | 0.003 | 0.021 | 0.033 | 0.011 | 0.650 |
| AS2 | 0.004 | 0.060 | 0.080 | 0.027 | 0.709 | | AS3 | 0.133 | 0.673 | 0.414 | 0.681 | 0.950 |
|
|
|
|
|
|
| MDAPE | AS1 | 0.000 | 0.005 | 0.018 | 0.004 | 0.444 | | AS2 | 0.108 | 0.716 | 0.943 | 0.424 | 0.529 |
| AS3 | 0.005 | 0.008 | 0.007 | 0.013 | 0.784 |
|
|
|
|
|
|
| SMAPE | AS1 | 0.394 | 0.601 | 0.769 | 0.410 | 0.376 |
| AS2 | 0.338 | 0.453 | 0.571 | 0.561 | 0.204 |
| AS3 | 0.006 | 0.003 | 0.005 | 0.001 | 0.913 |
|
|
|
|
|
|
| SMDAPE | AS1 | 0.194 | 0.843 | 0.737 | 0.543 | 0.935 |
| AS2 | 0.450 | 0.235 | 0.121 | 0.159 | 0.795 |
| AS3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.351 |
|
|
|
|
|
|
| MRAE | AS1 | 0.012 | 0.007 | 0.007 | 0.008 | 0.849 |
| AS2 | 0.000 | 0.000 | 0.000 | 0.000 | 0.623 |
| AS3 | 0.455 | 0.273 | 0.258 | 0.136 | 0.703 |
|
|
|
|
|
|
| MDRAE | AS1 | 0.000 | 0.009 | 0.020 | 0.007 | 0.701 |
| AS2 | 0.001 | 0.548 | 0.346 | 0.369 | 0.983 |
| AS3 | 0.054 | 0.302 | 0.256 | 0.163 | 0.737 |
|
|
|
|
|
|
| GMRAE | AS1 | 0.000 | 0.000 | 0.001 | 0.000 | 0.324 |
| AS2 | 0.000 | 0.002 | 0.006 | 0.002 | 0.732 |
| AS3 | 0.001 | 0.164 | 0.138 | 0.101 | 0.723 |
|
|
|
|
|
|
| RELMAE | AS1 | 0.000 | 0.000 | 0.002 | 0.000 | 0.307 |
| AS2 | 0.010 | 0.149 | 0.119 | 0.087 | 0.849 |
| AS3 | 0.005 | 0.097 | 0.115 | 0.050 | 0.618 |
|
|
|
|
|
|
| RELMSE | AS1 | 0.000 | 0.000 | 0.000 | 0.000 | 0.650 |
| AS2 | 0.001 | 0.330 | 0.191 | 0.234 | 0.828 |
| AS3 | 0.001 | 0.022 | 0.018 | 0.020 | 0.934 |
|
|
|
|
|
|
| |
Table 5: | p-values of the Fligner test. First column: Fligner test for differences
in variance, applied to the group of all model selection procedures (6 procedures
for AS1 and AS3, and 5 procedures for AS2). Second column: Test for the three
mentioned methods. Columns 3-5: Tests of interesting pairs of methods (without
application of a post-hoc procedure). |
|
3. Plots of the Results
The in-set error, estimated by the model selection procedure, is compared to the
error on the out-set. If the model selection procedure produces a good estimate for
the error the two errors should be very similar. Therefore, we analyze plots of points
(Ein-set,Eout-set). If the errors are equal, these points all lie on a line with origin
zero and gradient one. In the following, we call this type of evaluation point
plots.
Additionally to the point plots, we analyze box-and-whisker plots containing
directly the value of the quotient (Eout-set∕Ein-set), which is especially
interesting with the use of scale-dependent measures like the RMSE, as
with using the quotient a normalization takes place, so that the results are
comparable.
Section 3.1.2 shows point plots for scenario AS1, using different error measures. It
can be observed that the RELMAE yields a less scattered distribution than MDAPE
and MDRAE. The in-set error tends to overestimate the out-set error, especially
when using relative measures, i.e., RELMAE or MDRAE. No systematical difference
between different model selection procedures can be determined in this plot. To
further examine the different model selection porcedures, section. 3.1.3 shows the
results of section 3.1.2 in more detail, with the results of every model selection
procedure in a different plot. section 3.1.1 shows the results of section. 3.1.3 as a box
plot. Within the scenario AS2, noDepCV is not applicable any more. Point
plots and box plots analogous to the plots of scenario AS1 are shown in
section 3.2. Section 3.3 shows the results of scenario AS3, where real-world data is
used.
3.1. Plots for scenario (AS1)
3.1.1. (AS1) Box Plots
3.1.2. (AS1) Point Plots Combined
3.1.3. (AS1) Point Plots per Error Measure
3.2. Plots for scenario (AS2)
3.2.1. (AS2) Box Plots
3.2.2. (AS2) Point Plots Combined
3.2.3. (AS2) Point Plots per Error Measure
3.3. Plots for scenario (AS3)
3.3.1. (AS3) Box Plots
3.3.2. (AS3) Point Plots Combined
3.3.3. (AS3) Point Plots per Error Measure