Artificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors.
A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.
The simplest networks contain no hidden layers and are equivalent to linear regressions. The figure below shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimizes a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.
Once we add an intermediate layer with hidden neurons, the neural network becomes nonlinear. A simple example is shown in the figure below.
This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output. For example in the diagram above, the inputs into hidden neuron \(j\) in the figure above are combined linearly to give,
\[z_j=b_j + \sum_{k=1}^4 w_{k,j} x_k,\ \ \ \ where\ j=1,2,3\] When using using a feed-forward neural network in forecasting a time series, the inputs can be previous observed values of the time series \(y_t\). This is known as neural network autoregression. For example, we might use the previous four values to forecast next value of the time series, i.e. the inputs would be \(y_{t-1}, y_{t-2}, y_{t-3}, y_{t-4}\). The hidden nodes would then be given by,
\[z_j=b_j + \sum_{k=1}^4 w_{k,j} y_{t-k}\]
In the hidden layer, an activation function is applied to the hidden nodes. The activation function is a nonlinear function such as a sigmoid/logistic function,
\[\phi(z) = {1\over{1+e^{-z}}}\] that is applied to give the input to the next layer of the neural network. Application of the this function tends to reduce the effect of extreme input values, thus making the entire network more robust to the effect of outliers.
The output layer from the neural network is another linear combination of the hidden nodes after the activation function \(\phi(z)\) has been applied.
\[y_t = a + \sum_{j=1}^3 h_j \phi(z_j)\] or in terms of the inputs \(y_{t-1}, y_{t-2}, y_{t-3}, y_{t-4}\) we have,
\[y_t = a + \sum_{j=1}^3 h_j\times \phi(b_j + \sum_{k=1}^4 w_{k,j} y_{t-k})\] The parameters or weights \(a, h_1, h_2, h_3, b_1, b_2, b_3, w_{1,1}, w_{1,2}, \dots, w_{3,3}, w_{4,3}\) are “learned” from the data. The weights to be estimated are generally restricted to prevent them from becoming too large. The tuning parameter that restricts the weights is known as the “decay parameter”, and is often sit to be equal to \(0.1\).
The weights are “learned” by first choosing random values for them initially, and these randomly chosen weights are then updated using the observed data or time series. Because the weights are randomly chosen initially, there is an element of randomness to the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are then averaged. The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance. Cross-validation methods can be used to help choose “optimal” values for the model complexity.
With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in a linear autoregression models, \(AR(p)\). We call this a neural network autoregression or NNAR model.
We will only consider feed-forward neural networks with one hidden layer, and we use the notation \(NNAR(p,k)\) to indicate there are \(p\) lagged inputs (\(y_{t-1},\dots,y_{t-p}\)) and \(k\) nodes in the one hidden layer. For example, \(NNAR(4,3)\) model shown above, is a neural network with the last four observations (\(y_{t-1},y_{t-2},y_{t-3},y_{t-4}\)) as inputs for forecasting the output \(y_t\), and with three neurons in the hidden layer. A \(NNAR(p,0)\) model is equivalent to an \(ARIMA(p,0,0)\) model, but without the restrictions on the parameters to ensure stationarity, i.e.
\[y_t = a + \sum_{p=1}^4 w_p y_{t-p} + \epsilon_t\]
With seasonal data, it us useful to also add the last observed values from the same season as inputs. For example of \(y_t\) is the value of the time series in July we would include \(y_{t-12}\), the value of the time series the previous July, as an input. We could also include the July value from 2 years prior as well \(y_{t-2\times 12}\). In general, we could use up to \(P\) of the previous seasonal values as inputs, i.e. \(y_{t-m},y_{t-2m},\dots ,y_{t-Pm}\). The notation for a neural network model with seasonal inputs is \(NNAR(p,P,k)_m\). This model has inputs: \((y_{t-1},\dots,y_{t-p},y_{t-m},\dots,y_{t-Pm})\) and \(k\) neurons in the hidden layer. A \(NNAR(p,P,0)_m\) model is equivalent to an \(ARIMA(p,0,0)(P,0,0)_m\) model but without parameter contraints to ensure stationarity.
The R function nnetar
in the fpp2
library will fit a \(NNAR(p,P,k)_m\) model to time series and can be used to forecast future values of the time series. Generally we need to specify the values of \(p\),\(P\), and \(k\). If we do not specify a choice for the number of hidden nodes \(k\) the nnetar
function will use \(k =(p+P+1)/2\) rounded to the nearest integer. When it comes to forecasting, the network is applied iteratively. For forecasting one-step ahead, we simply use the historical inputs. For forecasting two-steps ahead however, we first have to perform a one-step ahead forecast to obtain \(\hat{y}_{T+1}\) and then treat this forecast as a historical value to obtain the next forecast \(\hat{y}_{T+2}\). This process is repeated iteratively in order to obtain a \(h-\)step ahead forecast \(\hat{y}_{T+h}\)
These data are contained in the time series lynx
in the fpp2
library and represent the annual number of lynx trapped in McKenzie River district of northwest Canada: 1824-1934.
require(fpp2)
## Loading required package: fpp2
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
lynx
## Time Series:
## Start = 1821
## End = 1934
## Frequency = 1
## [1] 269 321 585 871 1475 2821 3928 5943 4950 2577 523 98 184 279
## [15] 409 2285 2685 3409 1824 409 151 45 68 213 546 1033 2129 2536
## [29] 957 361 377 225 360 731 1638 2725 2871 2119 684 299 236 245
## [43] 552 1623 3311 6721 4254 687 255 473 358 784 1594 1676 2251 1426
## [57] 756 299 201 229 469 736 2042 2811 4431 2511 389 73 39 49
## [71] 59 188 377 1292 4031 3495 587 105 153 387 758 1307 3465 6991
## [85] 6313 3794 1836 345 382 808 1388 2713 3800 3091 2985 3790 674 81
## [99] 80 108 229 399 1132 2432 3574 2935 1537 529 485 662 1000 1590
## [113] 2657 3396
autoplot(lynx) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Annual Canadian Lynx Trappings (1821-1934)")
ggtsdisplay(lynx)
These data appear to be seasonal, however the peaks and troughs do not occur with a constant frequency \(m\). Furthermore these are annual counts, thus the behavior we are seeing would be characterized as cyclic, but NOT seasonal. Previous methods used for forecasting with these data has provided disappointing results. We will now consider using a neural network autoregression to make forecasts for these data.
nn1 = nnetar(lynx,p=4,P=0,size=2)
nn1
## Series: lynx
## Model: NNAR(4,2)
## Call: nnetar(y = lynx, p = 4, P = 0, size = 2)
##
## Average of 20 networks, each of which is
## a 4-2-1 network with 13 weights
## options were - linear output units
##
## sigma^2 estimated as 400454
checkresiduals(nn1)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
fc1 = forecast(nn1,h=10)
autoplot(fc1) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Ann. Lynx Trappings with 10-year Forecast")
fc1.PI = forecast(nn1,h=10,PI=T)
autoplot(fc1.PI) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Ann. Lynx Trappings with 10-year Forecast")
fc1.PI
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1935 3326.1703 2427.7158 4152.533 2007.8655 4649.509
## 1936 2761.3666 952.8359 3842.836 -101.0709 4483.687
## 1937 1491.1505 -113.5716 2911.833 -618.0791 3756.263
## 1938 338.4665 -489.7661 1466.347 -993.2523 2031.027
## 1939 239.9978 -510.6163 1416.165 -902.9644 1998.414
## 1940 347.1073 -358.6480 2082.007 -848.7747 3029.568
## 1941 452.7462 -139.5469 3065.831 -659.0079 4060.611
## 1942 778.8452 136.2547 4061.362 -409.3429 4934.204
## 1943 1535.9309 421.2485 4478.042 -320.3123 5224.197
## 1944 2612.7034 463.5767 4479.047 -283.8842 5208.211
autoplot(lynx) + autolayer(fitted(nn1))
## Warning: Removed 4 rows containing missing values (geom_path).
This fit seems good, but is it overfitting? We can assess this by performing cross-validation using a training/test set approach.
length(lynx)
## [1] 114
lynx.test = tail(lynx,14)
lynx.train = head(lynx,100)
nn1 = nnetar(lynx.train,p=4,P=0,size=2)
fc1 = forecast(nn1,h=14)
accuracy(fc1,lynx.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -3.128223 641.5337 435.9073 -51.60446 70.65446 0.5111859
## Test set -1.810680 758.9525 617.3723 -16.45466 64.93984 0.7239888
## ACF1 Theil's U
## Training set 0.04928738 NA
## Test set 0.75553545 2.390399
autoplot(fc1) + autolayer(lynx.test,series="Actual")
autoplot(lynx.test) + autolayer(fc1,series="Forecast")
The forecast does not look too bad for these \(p\) and \(k\) choices . We should note that there was a fair amount of trial-and-error in arriving at these choices.
This time series contains monthly retail sales in millions of dollars from 1992 - 2018. We will examine the performance of neural networks for this strongly seasonal time series.
Cloth = read.csv("http://course1.winona.edu/bdeppa/FIN%20335/Datasets/US%20Clothing%20Sales%20(millions%20of%20dollars%20-%201992%20to%20present).csv",header=T)
head(Cloth)
## DATE Clothing
## 1 1/1/1992 6938
## 2 2/1/1992 7524
## 3 3/1/1992 8475
## 4 4/1/1992 9401
## 5 5/1/1992 9558
## 6 6/1/1992 9182
ClothSales = ts(Cloth$Clothing,start=1992,frequency=12)
ClothSales = window(ClothSales,start=2010)
autoplot(ClothSales) + xlab("Year") + ylab("Millions $") + ggtitle("Monthly Retail Clothing Sales (January 2010-January 2018")
nn1 = nnetar(ClothSales,p=3,P=1,size=3)
nn1
## Series: ClothSales
## Model: NNAR(3,1,3)[12]
## Call: nnetar(y = ClothSales, p = 3, P = 1, size = 3)
##
## Average of 20 networks, each of which is
## a 4-3-1 network with 19 weights
## options were - linear output units
##
## sigma^2 estimated as 152303
checkresiduals(nn1)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
fc = forecast(nn1,h=24,PI=T)
autoplot(fc) + xlab("Year") + ylab("Millions $") + ggtitle("Clothing Sales with Forecast from NNAR(3,1,3)")
Again we will consider a training/test set approach for cross-validating this model.
length(ClothSales)
## [1] 97
Cloth.test = tail(ClothSales,12)
Cloth.train = head(ClothSales,length(ClothSales)-12)
nn1 = nnetar(Cloth.train,p=4,P=2,size=3,repeats=30)
fc = forecast(nn1,h=12)
autoplot(fc) + autolayer(Cloth.test,series="Actual")
accuracy(fc,Cloth.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.7849936 293.3413 240.6506 -0.03408947 1.212148 0.341408
## Test set 70.7959700 515.3685 373.8511 0.36457895 1.642776 0.530378
## ACF1 Theil's U
## Training set 0.01792396 NA
## Test set -0.53207435 0.11077
Cloth.hw = hw(Cloth.train,h=12)
accuracy(Cloth.hw,Cloth.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -72.59406 486.1052 382.5669 -0.4345672 1.993428 0.5427429
## Test set -257.59198 626.3150 507.7408 -1.5735583 2.430074 0.7203257
## ACF1 Theil's U
## Training set -0.1410691 NA
## Test set 0.1022929 0.1073411
Cloth.ets = ets(Cloth.train)
fc.ets = forecast(Cloth.ets,h=12)
accuracy(fc.ets,Cloth.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -30.99871 361.8720 297.2967 -0.21897668 1.509087 0.4217712
## Test set 40.83965 456.6476 361.3441 0.03397949 1.688414 0.5126344
## ACF1 Theil's U
## Training set -0.08154105 NA
## Test set 0.11982844 0.08476344
Cloth.ARIMA = auto.arima(Cloth.train,approximation=F,stepwise=F)
fc.ARIMA = forecast(Cloth.ARIMA,h=12)
accuracy(fc.ARIMA,Cloth.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -9.430964 465.1664 335.4993 -0.1801416 1.669091 0.4759687
## Test set -352.250883 623.3325 499.7388 -1.7601474 2.391499 0.7089734
## ACF1 Theil's U
## Training set -0.02641904 NA
## Test set -0.03493485 0.1122988
Here we can see that the NNAR(4,2,3) model outperforms both the Holt-Winter’s and the ETS forecasts for these data. As with the first example a bit of trial-and-error led to these choices for \(p\), \(P\), and \(k\). Undoubtedly there are several other reasonable choices for these parameters and with some probably outperforming these.
Unlike most of the methods considered in this book, neural networks are not based on a well-defined stochastic model, and so it is not straightforward to derive prediction intervals for the resultant forecasts. However, we can still compute prediction intervals using simulation where future sample paths are generated using bootstrapped residuals.
The neural network fit to the clothing sales time series can be written as
\[y_t = f({\bf{y_{t-1}}}) + \epsilon_t\] where \({\bf{y_{t-1}}} = (y_{t-1},y_{t-2},y_{t-3},y_{t-4},y_{t-12},y_{t-24})\) is a vector containing lagged values of the series used as inputs to the neural network, and \(\bf{f}\) is a neural network with a single hidden layer with two nodes. The error series \(\epsilon_t\) is assumed to have mean \(0\) with constant variation \(\sigma^2\) and potentially be normally distributed.
We can simulate future sample paths of this model iteratively, by randomly generating a value for \(\epsilon_t\), either from a normal distribution, or by resampling from the historical values of the forecast errors/residuals \(\hat{e}_t\). So if \(\epsilon^*_{T+1}\) is random draw from the distribution of errors observed up to time \(T+1\), then we can generate a future value of the time series at time \(T+1\) by using our neural network model along with this random error added.
\[y^*_{T+1} = f({\bf y}_T) + \epsilon^*_{T+1}\]
Setting
\[{\bf y}^*_{T+1} = (y^*_{T+1},y_{T},y_{T-1} ,y_{T-2}, y_{T-3},y_{{T+1}-12},y_{{T+1}-24})\]
we can repeat the process to obtain \[y^*_{T+2} = f({\bf y}^*_{T+1}) + \epsilon^*_{T+2}\]
In this way, we can iteratively simulate a future sample path for the time series. By doing this a large number of times we build up knowledge of the distribution for all future values of the time series based upon our neural network.
Below is the code to simulate the next two years of monthly retail clothing sales using our neural network developed for these data above.
nn1 = nnetar(ClothSales,p=4,P=2,size=3)
sim <- ts(matrix(0, nrow=24L, ncol=10L),start=end(ClothSales)[1L],frequency=12)
for(i in seq(10)) {sim[,i]=simulate(nn1,nsim=24L)}
autoplot(ClothSales)+autolayer(sim) + ylab("Millions $") + xlab("Year") + ggtitle("US Monthly Retail Clothing Sales with Simulations")
## For a multivariate timeseries, specify a seriesname for each timeseries. Defaulting to column names.
If we repeated this say 1,000 times, rather than 10, we could obtain approximate 80% Prediction Intervals at each future time point by taking the 100th smallest and 900th largest values at that time point to form the interval. This is what the forecast
function does when no closed form for the standard error exists, which is the case with neural network time series models. In order to obtain them you need to specify PI=T
inside the forecast
function.
fc = forecast(nn1,PI=T,h=24)
autoplot(fc) + xlab("Year") + ylab("Millions $") + ggtitle("NNAR(4,2,3) Forecasts for US Monthly Clothing Sales")
fc
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2018 18165.74 17772.80 18609.45 17517.03 18802.78
## Mar 2018 21214.96 20799.92 21649.75 20601.86 21895.62
## Apr 2018 21206.73 20742.39 21622.54 20483.35 21888.21
## May 2018 21692.46 21263.54 22141.85 20987.55 22347.92
## Jun 2018 20565.66 20130.79 21036.01 19897.60 21238.52
## Jul 2018 20873.02 20421.25 21335.07 20178.51 21557.67
## Aug 2018 22966.92 22515.38 23376.79 22250.81 23581.97
## Sep 2018 20066.48 19605.87 20521.83 19400.74 20776.54
## Oct 2018 20468.28 20032.54 20908.19 19764.62 21150.05
## Nov 2018 24827.27 24396.59 25228.41 24191.90 25453.39
## Dec 2018 34814.23 34361.94 35217.05 34131.23 35453.91
## Jan 2019 16379.75 15998.11 16823.45 15778.72 17055.80
## Feb 2019 18160.86 17731.26 18588.14 17456.87 18804.66
## Mar 2019 21294.42 20873.45 21762.91 20584.61 21979.11
## Apr 2019 21593.96 20984.16 22147.02 20650.69 22507.20
## May 2019 21595.46 21054.63 22124.39 20710.43 22444.53
## Jun 2019 20693.00 20168.91 21251.60 19795.02 21545.40
## Jul 2019 21110.54 20594.39 21611.98 20304.75 21868.18
## Aug 2019 23207.49 22610.85 23740.83 22361.91 23969.23
## Sep 2019 20155.81 19610.56 20679.02 19332.34 20953.88
## Oct 2019 20413.41 19904.27 20979.16 19602.13 21302.59
## Nov 2019 25148.49 24575.07 25645.81 24293.43 25926.07
## Dec 2019 34920.22 34413.74 35352.47 34178.78 35594.56
## Jan 2020 16570.24 16089.98 17113.84 15791.76 17423.16