Abstract: | The 2018 M4 Forecasting Competition was the first M Competition to elicit prediction intervals in addition to point estimates. We take a closer look at the twenty valid interval submissions by examining the calibration and accuracy of the prediction intervals and evaluating their performances over different time horizons. Overall, the submissions fail to estimate the uncertainty properly. Importantly, we investigate the benefits of interval combination using six recently-proposed heuristics that can be applied prior to learning about the realizations of the quantities. Our results suggest that interval aggregation offers improvements in terms of both calibration and accuracy. While averaging interval endpoints maintains its practical appeal as being simple to implement and performs quite well when data sets are large, the median and the interior trimmed average are found to be robust aggregators for the prediction interval submissions across all 100,000 time series. |