9 - Advanced Methods: Neural Networks

9.1 Introduction
9.2 Neural Network Autoregression (NNAR)
- Example 9.1 - Annual Canadian Lynx Trappings (1821-1934)
- Example 9.2 - U.S. Monthly Retail Clothing Sales
9.3 Prediction Intervals for NNAR Models

9.1 Introduction

Artificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors.

A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.

The simplest networks contain no hidden layers and are equivalent to linear regressions. The figure below shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimizes a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.

Once we add an intermediate layer with hidden neurons, the neural network becomes nonlinear. A simple example is shown in the figure below.

This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output. For example in the diagram above, the inputs into hidden neuron \(j\) in the figure above are combined linearly to give,

\[z_j=b_j + \sum_{k=1}^4 w_{k,j} x_k,\ \ \ \ where\ j=1,2,3\] When using using a feed-forward neural network in forecasting a time series, the inputs can be previous observed values of the time series \(y_t\). This is known as neural network autoregression. For example, we might use the previous four values to forecast next value of the time series, i.e. the inputs would be \(y_{t-1}, y_{t-2}, y_{t-3}, y_{t-4}\). The hidden nodes would then be given by,

\[z_j=b_j + \sum_{k=1}^4 w_{k,j} y_{t-k}\]

In the hidden layer, an activation function is applied to the hidden nodes. The activation function is a nonlinear function such as a sigmoid/logistic function,

\[\phi(z) = {1\over{1+e^{-z}}}\] that is applied to give the input to the next layer of the neural network. Application of the this function tends to reduce the effect of extreme input values, thus making the entire network more robust to the effect of outliers.

The output layer from the neural network is another linear combination of the hidden nodes after the activation function \(\phi(z)\) has been applied.

\[y_t = a + \sum_{j=1}^3 h_j \phi(z_j)\] or in terms of the inputs \(y_{t-1}, y_{t-2}, y_{t-3}, y_{t-4}\) we have,

\[y_t = a + \sum_{j=1}^3 h_j\times \phi(b_j + \sum_{k=1}^4 w_{k,j} y_{t-k})\] The parameters or weights \(a, h_1, h_2, h_3, b_1, b_2, b_3, w_{1,1}, w_{1,2}, \dots, w_{3,3}, w_{4,3}\) are “learned” from the data. The weights to be estimated are generally restricted to prevent them from becoming too large. The tuning parameter that restricts the weights is known as the “decay parameter”, and is often sit to be equal to \(0.1\).

The weights are “learned” by first choosing random values for them initially, and these randomly chosen weights are then updated using the observed data or time series. Because the weights are randomly chosen initially, there is an element of randomness to the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are then averaged. The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance. Cross-validation methods can be used to help choose “optimal” values for the model complexity.

9.2 Neural Network Autoregression (NNAR)

With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in a linear autoregression models, \(AR(p)\). We call this a neural network autoregression or NNAR model.

We will only consider feed-forward neural networks with one hidden layer, and we use the notation \(NNAR(p,k)\) to indicate there are \(p\) lagged inputs (\(y_{t-1},\dots,y_{t-p}\)) and \(k\) nodes in the one hidden layer. For example, \(NNAR(4,3)\) model shown above, is a neural network with the last four observations (\(y_{t-1},y_{t-2},y_{t-3},y_{t-4}\)) as inputs for forecasting the output \(y_t\), and with three neurons in the hidden layer. A \(NNAR(p,0)\) model is equivalent to an \(ARIMA(p,0,0)\) model, but without the restrictions on the parameters to ensure stationarity, i.e.

\[y_t = a + \sum_{p=1}^4 w_p y_{t-p} + \epsilon_t\]

With seasonal data, it us useful to also add the last observed values from the same season as inputs. For example of \(y_t\) is the value of the time series in July we would include \(y_{t-12}\), the value of the time series the previous July, as an input. We could also include the July value from 2 years prior as well \(y_{t-2\times 12}\). In general, we could use up to \(P\) of the previous seasonal values as inputs, i.e. \(y_{t-m},y_{t-2m},\dots ,y_{t-Pm}\). The notation for a neural network model with seasonal inputs is \(NNAR(p,P,k)_m\). This model has inputs: \((y_{t-1},\dots,y_{t-p},y_{t-m},\dots,y_{t-Pm})\) and \(k\) neurons in the hidden layer. A \(NNAR(p,P,0)_m\) model is equivalent to an \(ARIMA(p,0,0)(P,0,0)_m\) model but without parameter contraints to ensure stationarity.

The R function nnetar in the fpp2 library will fit a \(NNAR(p,P,k)_m\) model to time series and can be used to forecast future values of the time series. Generally we need to specify the values of \(p\),\(P\), and \(k\). If we do not specify a choice for the number of hidden nodes \(k\) the nnetar function will use \(k =(p+P+1)/2\) rounded to the nearest integer. When it comes to forecasting, the network is applied iteratively. For forecasting one-step ahead, we simply use the historical inputs. For forecasting two-steps ahead however, we first have to perform a one-step ahead forecast to obtain \(\hat{y}_{T+1}\) and then treat this forecast as a historical value to obtain the next forecast \(\hat{y}_{T+2}\). This process is repeated iteratively in order to obtain a \(h-\)step ahead forecast \(\hat{y}_{T+h}\)

Example 9.1 - Annual Canadian Lynx Trappings (1821-1934)

These data are contained in the time series lynx in the fpp2 library and represent the annual number of lynx trapped in McKenzie River district of northwest Canada: 1824-1934.

require(fpp2)

## Loading required package: fpp2

## Loading required package: ggplot2

## Loading required package: forecast

## Loading required package: fma

## Loading required package: expsmooth

lynx

## Time Series:
## Start = 1821 
## End = 1934 
## Frequency = 1 
##   [1]  269  321  585  871 1475 2821 3928 5943 4950 2577  523   98  184  279
##  [15]  409 2285 2685 3409 1824  409  151   45   68  213  546 1033 2129 2536
##  [29]  957  361  377  225  360  731 1638 2725 2871 2119  684  299  236  245
##  [43]  552 1623 3311 6721 4254  687  255  473  358  784 1594 1676 2251 1426
##  [57]  756  299  201  229  469  736 2042 2811 4431 2511  389   73   39   49
##  [71]   59  188  377 1292 4031 3495  587  105  153  387  758 1307 3465 6991
##  [85] 6313 3794 1836  345  382  808 1388 2713 3800 3091 2985 3790  674   81
##  [99]   80  108  229  399 1132 2432 3574 2935 1537  529  485  662 1000 1590
## [113] 2657 3396

autoplot(lynx) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Annual Canadian Lynx Trappings (1821-1934)")

ggtsdisplay(lynx)

These data appear to be seasonal, however the peaks and troughs do not occur with a constant frequency \(m\). Furthermore these are annual counts, thus the behavior we are seeing would be characterized as cyclic, but NOT seasonal. Previous methods used for forecasting with these data has provided disappointing results. We will now consider using a neural network autoregression to make forecasts for these data.

nn1 = nnetar(lynx,p=4,P=0,size=2)
nn1

## Series: lynx 
## Model:  NNAR(4,2) 
## Call:   nnetar(y = lynx, p = 4, P = 0, size = 2)
## 
## Average of 20 networks, each of which is
## a 4-2-1 network with 13 weights
## options were - linear output units 
## 
## sigma^2 estimated as 400454

checkresiduals(nn1)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

fc1 = forecast(nn1,h=10)
autoplot(fc1) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Ann. Lynx Trappings with 10-year Forecast")

fc1.PI = forecast(nn1,h=10,PI=T)
autoplot(fc1.PI) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Ann. Lynx Trappings with 10-year Forecast")

fc1.PI

##      Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 1935      3326.1703 2427.7158 4152.533 2007.8655 4649.509
## 1936      2761.3666  952.8359 3842.836 -101.0709 4483.687
## 1937      1491.1505 -113.5716 2911.833 -618.0791 3756.263
## 1938       338.4665 -489.7661 1466.347 -993.2523 2031.027
## 1939       239.9978 -510.6163 1416.165 -902.9644 1998.414
## 1940       347.1073 -358.6480 2082.007 -848.7747 3029.568
## 1941       452.7462 -139.5469 3065.831 -659.0079 4060.611
## 1942       778.8452  136.2547 4061.362 -409.3429 4934.204
## 1943      1535.9309  421.2485 4478.042 -320.3123 5224.197
## 1944      2612.7034  463.5767 4479.047 -283.8842 5208.211

autoplot(lynx) + autolayer(fitted(nn1))

## Warning: Removed 4 rows containing missing values (geom_path).

This fit seems good, but is it overfitting? We can assess this by performing cross-validation using a training/test set approach.

length(lynx)

## [1] 114

lynx.test = tail(lynx,14)
lynx.train = head(lynx,100)
nn1 = nnetar(lynx.train,p=4,P=0,size=2)
fc1 = forecast(nn1,h=14)
accuracy(fc1,lynx.test)

##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -3.128223 641.5337 435.9073 -51.60446 70.65446 0.5111859
## Test set     -1.810680 758.9525 617.3723 -16.45466 64.93984 0.7239888
##                    ACF1 Theil's U
## Training set 0.04928738        NA
## Test set     0.75553545  2.390399

autoplot(fc1) + autolayer(lynx.test,series="Actual")

autoplot(lynx.test) + autolayer(fc1,series="Forecast")

The forecast does not look too bad for these \(p\) and \(k\) choices . We should note that there was a fair amount of trial-and-error in arriving at these choices.

Example 9.2 - U.S. Monthly Retail Clothing Sales

This time series contains monthly retail sales in millions of dollars from 1992 - 2018. We will examine the performance of neural networks for this strongly seasonal time series.

Cloth = read.csv("http://course1.winona.edu/bdeppa/FIN%20335/Datasets/US%20Clothing%20Sales%20(millions%20of%20dollars%20-%201992%20to%20present).csv",header=T)
head(Cloth)

##       DATE Clothing
## 1 1/1/1992     6938
## 2 2/1/1992     7524
## 3 3/1/1992     8475
## 4 4/1/1992     9401
## 5 5/1/1992     9558
## 6 6/1/1992     9182

ClothSales = ts(Cloth$Clothing,start=1992,frequency=12)
ClothSales = window(ClothSales,start=2010)
autoplot(ClothSales) + xlab("Year") + ylab("Millions $") + ggtitle("Monthly Retail Clothing Sales (January 2010-January 2018")

nn1 = nnetar(ClothSales,p=3,P=1,size=3)
nn1

## Series: ClothSales 
## Model:  NNAR(3,1,3)[12] 
## Call:   nnetar(y = ClothSales, p = 3, P = 1, size = 3)
## 
## Average of 20 networks, each of which is
## a 4-3-1 network with 19 weights
## options were - linear output units 
## 
## sigma^2 estimated as 152303

checkresiduals(nn1)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

fc = forecast(nn1,h=24,PI=T)
autoplot(fc) + xlab("Year") + ylab("Millions $") + ggtitle("Clothing Sales with Forecast from NNAR(3,1,3)")

Again we will consider a training/test set approach for cross-validating this model.

length(ClothSales)

## [1] 97

Cloth.test = tail(ClothSales,12)
Cloth.train = head(ClothSales,length(ClothSales)-12)
nn1 = nnetar(Cloth.train,p=4,P=2,size=3,repeats=30)
fc = forecast(nn1,h=12)
autoplot(fc) + autolayer(Cloth.test,series="Actual")

accuracy(fc,Cloth.test)

##                      ME     RMSE      MAE         MPE     MAPE     MASE
## Training set -0.7849936 293.3413 240.6506 -0.03408947 1.212148 0.341408
## Test set     70.7959700 515.3685 373.8511  0.36457895 1.642776 0.530378
##                     ACF1 Theil's U
## Training set  0.01792396        NA
## Test set     -0.53207435   0.11077

Cloth.hw = hw(Cloth.train,h=12)
accuracy(Cloth.hw,Cloth.test)

##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set  -72.59406 486.1052 382.5669 -0.4345672 1.993428 0.5427429
## Test set     -257.59198 626.3150 507.7408 -1.5735583 2.430074 0.7203257
##                    ACF1 Theil's U
## Training set -0.1410691        NA
## Test set      0.1022929 0.1073411

Cloth.ets = ets(Cloth.train)
fc.ets = forecast(Cloth.ets,h=12)
accuracy(fc.ets,Cloth.test)

##                     ME     RMSE      MAE         MPE     MAPE      MASE
## Training set -30.99871 361.8720 297.2967 -0.21897668 1.509087 0.4217712
## Test set      40.83965 456.6476 361.3441  0.03397949 1.688414 0.5126344
##                     ACF1  Theil's U
## Training set -0.08154105         NA
## Test set      0.11982844 0.08476344

Cloth.ARIMA = auto.arima(Cloth.train,approximation=F,stepwise=F)
fc.ARIMA = forecast(Cloth.ARIMA,h=12)
accuracy(fc.ARIMA,Cloth.test)

##                       ME     RMSE      MAE        MPE     MAPE      MASE
## Training set   -9.430964 465.1664 335.4993 -0.1801416 1.669091 0.4759687
## Test set     -352.250883 623.3325 499.7388 -1.7601474 2.391499 0.7089734
##                     ACF1 Theil's U
## Training set -0.02641904        NA
## Test set     -0.03493485 0.1122988

Here we can see that the NNAR(4,2,3) model outperforms both the Holt-Winter’s and the ETS forecasts for these data. As with the first example a bit of trial-and-error led to these choices for \(p\), \(P\), and \(k\). Undoubtedly there are several other reasonable choices for these parameters and with some probably outperforming these.

9.3 Prediction Intervals for NNAR Models

Unlike most of the methods considered in this book, neural networks are not based on a well-defined stochastic model, and so it is not straightforward to derive prediction intervals for the resultant forecasts. However, we can still compute prediction intervals using simulation where future sample paths are generated using bootstrapped residuals.

The neural network fit to the clothing sales time series can be written as

\[y_t = f({\bf{y_{t-1}}}) + \epsilon_t\] where \({\bf{y_{t-1}}} = (y_{t-1},y_{t-2},y_{t-3},y_{t-4},y_{t-12},y_{t-24})\) is a vector containing lagged values of the series used as inputs to the neural network, and \(\bf{f}\) is a neural network with a single hidden layer with two nodes. The error series \(\epsilon_t\) is assumed to have mean \(0\) with constant variation \(\sigma^2\) and potentially be normally distributed.

We can simulate future sample paths of this model iteratively, by randomly generating a value for \(\epsilon_t\), either from a normal distribution, or by resampling from the historical values of the forecast errors/residuals \(\hat{e}_t\). So if \(\epsilon^*_{T+1}\) is random draw from the distribution of errors observed up to time \(T+1\), then we can generate a future value of the time series at time \(T+1\) by using our neural network model along with this random error added.

\[y^*_{T+1} = f({\bf y}_T) + \epsilon^*_{T+1}\]

Setting

\[{\bf y}^*_{T+1} = (y^*_{T+1},y_{T},y_{T-1} ,y_{T-2}, y_{T-3},y_{{T+1}-12},y_{{T+1}-24})\]

we can repeat the process to obtain \[y^*_{T+2} = f({\bf y}^*_{T+1}) + \epsilon^*_{T+2}\]

In this way, we can iteratively simulate a future sample path for the time series. By doing this a large number of times we build up knowledge of the distribution for all future values of the time series based upon our neural network.

Below is the code to simulate the next two years of monthly retail clothing sales using our neural network developed for these data above.

nn1 = nnetar(ClothSales,p=4,P=2,size=3)
sim <- ts(matrix(0, nrow=24L, ncol=10L),start=end(ClothSales)[1L],frequency=12)
for(i in seq(10)) {sim[,i]=simulate(nn1,nsim=24L)}
autoplot(ClothSales)+autolayer(sim) + ylab("Millions $") + xlab("Year") + ggtitle("US Monthly Retail Clothing Sales with Simulations")

## For a multivariate timeseries, specify a seriesname for each timeseries. Defaulting to column names.

If we repeated this say 1,000 times, rather than 10, we could obtain approximate 80% Prediction Intervals at each future time point by taking the 100th smallest and 900th largest values at that time point to form the interval. This is what the forecast function does when no closed form for the standard error exists, which is the case with neural network time series models. In order to obtain them you need to specify PI=T inside the forecast function.

fc = forecast(nn1,PI=T,h=24)
autoplot(fc) + xlab("Year") + ylab("Millions $") + ggtitle("NNAR(4,2,3) Forecasts for US Monthly Clothing Sales")

fc

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Feb 2018       18165.74 17772.80 18609.45 17517.03 18802.78
## Mar 2018       21214.96 20799.92 21649.75 20601.86 21895.62
## Apr 2018       21206.73 20742.39 21622.54 20483.35 21888.21
## May 2018       21692.46 21263.54 22141.85 20987.55 22347.92
## Jun 2018       20565.66 20130.79 21036.01 19897.60 21238.52
## Jul 2018       20873.02 20421.25 21335.07 20178.51 21557.67
## Aug 2018       22966.92 22515.38 23376.79 22250.81 23581.97
## Sep 2018       20066.48 19605.87 20521.83 19400.74 20776.54
## Oct 2018       20468.28 20032.54 20908.19 19764.62 21150.05
## Nov 2018       24827.27 24396.59 25228.41 24191.90 25453.39
## Dec 2018       34814.23 34361.94 35217.05 34131.23 35453.91
## Jan 2019       16379.75 15998.11 16823.45 15778.72 17055.80
## Feb 2019       18160.86 17731.26 18588.14 17456.87 18804.66
## Mar 2019       21294.42 20873.45 21762.91 20584.61 21979.11
## Apr 2019       21593.96 20984.16 22147.02 20650.69 22507.20
## May 2019       21595.46 21054.63 22124.39 20710.43 22444.53
## Jun 2019       20693.00 20168.91 21251.60 19795.02 21545.40
## Jul 2019       21110.54 20594.39 21611.98 20304.75 21868.18
## Aug 2019       23207.49 22610.85 23740.83 22361.91 23969.23
## Sep 2019       20155.81 19610.56 20679.02 19332.34 20953.88
## Oct 2019       20413.41 19904.27 20979.16 19602.13 21302.59
## Nov 2019       25148.49 24575.07 25645.81 24293.43 25926.07
## Dec 2019       34920.22 34413.74 35352.47 34178.78 35594.56
## Jan 2020       16570.24 16089.98 17113.84 15791.76 17423.16