5 Model methods

5.1 Interpolation

Models that can be estimated in the presence of missing values can often be used to interpolate the unknown values. Often interpolated values can be taken from model’s fitted values, and some models may support more sophisticated interpolation methods.

The forecast package provides the na.interp function for interpolating time series data, which uses linear interpolation for non-seasonal data, and STL decomposition for seasonal data.

Tidy time series tools should allow users to interpolate missing values using any appropriate model.

For example, the tsibbledata::olympic_running dataset contains Olympic men’s 400m track final winning times. The winning times for the 1916, 1940 and 1944 Olympics are missing from the dataset due to the World Wars.

## Warning: Removed 31 rows containing missing values (geom_point).

We could then interpolate these missing values using the fitted values from a linear model with a trend:

olympic_running %>% 
  model(lm = TSLM(Time ~ trend())) %>% 
  interpolate(olympic_running)
## # A tsibble: 312 x 4 [4Y]
## # Key:       Length, Sex [14]
##    Length Sex    Year  Time
##    <fct>  <chr> <dbl> <dbl>
##  1 100m   men    1896  12  
##  2 100m   men    1900  11  
##  3 100m   men    1904  11  
##  4 100m   men    1908  10.8
##  5 100m   men    1912  10.8
##  6 100m   men    1916  10.8
##  7 100m   men    1920  10.8
##  8 100m   men    1924  10.6
##  9 100m   men    1928  10.8
## 10 100m   men    1932  10.3
## # … with 302 more rows
## Warning: Removed 31 rows containing missing values (geom_point).

5.2 Re-estimation

https://github.com/tidyverts/fable/issues/43

5.2.1 refit()

The refitting a model allows the same model to be applied to a new dataset. This is similar to the model argument available in most modelling functions from the forecast package.

The refitted model should maintain the same structure and coefficients of the original model, with fitted information updated to reflect the model’s behaviour on the new dataset. It should also be possible to allow re-estimation of parameters using the reestimate argument, which keeps the selected model terms but updates the model coefficients/parameters.

It is expected that a refit method uses a fitted model and replacement data to return a mable.

For the ETS model for mdeaths estimated above:

library(fable)
ets_fit <- as_tsibble(mdeaths) %>% 
  model(ETS(value))

We may be interested in using the same model with the same coefficients to estimate the fdeaths series:

refit(ets_fit, as_tsibble(fdeaths))
## # A mable: 1 x 1
##   `ETS(value)`
##   <model>     
## 1 <ETS(M,A,A)>

5.2.2 stream()

Streaming data into a model allows a model to be extended to accomodate new, future data. Like refit, stream should allow re-estimation of the model parameters. As this can be a costly operation for some models, in most cases updating the parameters should not occur. However it is recommended that the model parameters are updated on a regular basis.

Suppose we are estimating electricity demand data (tsibbledata::aus_elec), and after fitting a model to the existing data, a new set of data from the next month becomes available.

## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

A (minimal) model for the electricity demand above can be estimated using fasster.

fit <- elec_tr %>% 
  model(fasster = fasster(Demand ~ Holiday %S% (poly(1) + trig(10))))

To extend these fitted values to include December’s electricity data, we can use the stream functionality:

fit <- fit %>% 
  stream(elec_stream)

5.3 Simulation

Much like the tidymodels opinion toward predict, generate should not default to an archived version of the training set. This allows models to be used for simulating new data sets, which is especially relevant for time series as often future paths beyond the training set are simulated.

The generate method for a fable model should accept these arguments (names chosen for consistency with tidymodels):

  • object: The model itself
  • new_data: The data used for simulation
  • times: The number of simulated series (handled by fablelite)
  • seed: Random generator initialisation (handled by fablelite)

The new_data dataset extends existing stats::simulate functionality by allowing the simulation to accept a new time index for simulating beyond the sample (.idx), and allows the simulation to work with a new set of exogenous regressors (say x1 and x2).

It is expected that the innovations (.innov) for the simulation are randomly generated for each repition number (rep), which can be achieved using the times argument. However, users should also be able to provide a set of pre-generated innovations (.innov) for each repition (.rep). If these columns are provided in the new_data, then this data will be passed directly to the simulation method (without generating new numbers over times replications).

## Warning: `id()` is deprecated for creating key.
## Please use `key = .rep`.
## # A tsibble: 9 x 5 [1M]
## # Key:       .rep [3]
##    .rep     .idx .innov      x1     x2
##   <int>    <mth>  <dbl>   <dbl>  <dbl>
## 1     1 2017 Jan  0.398  2.93   -1.08 
## 2     1 2017 Feb -1.40   1.42    0.347
## 3     1 2017 Mar -0.513  1.27   -1.02 
## 4     2 2017 Jan  0.584  6.88   -2.96 
## 5     2 2017 Feb -0.655  0.627  -1.65 
## 6     2 2017 Mar  0.870  4.56   -1.23 
## 7     3 2017 Jan -0.889 -0.0771 -2.42 
## 8     3 2017 Feb -1.56   1.78   -2.48 
## 9     3 2017 Mar  0.508  0.556  -3.23

For the end user, creating simulations would work like this:

library(fable)
library(tsibbledata)
UKLungDeaths %>% 
  model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>% 
  generate(UKLungDeaths, times = 5)
## # A tsibble: 360 x 4 [1M]
## # Key:       .rep, .model [5]
##    .model  .rep    index  .sim
##    <chr>  <int>    <mth> <dbl>
##  1 lm         1 1974 Jan 2260.
##  2 lm         1 1974 Feb 1655.
##  3 lm         1 1974 Mar 2091.
##  4 lm         1 1974 Apr 1762.
##  5 lm         1 1974 May 1360.
##  6 lm         1 1974 Jun  950.
##  7 lm         1 1974 Jul 1153.
##  8 lm         1 1974 Aug 1031.
##  9 lm         1 1974 Sep 1133.
## 10 lm         1 1974 Oct 1506.
## # … with 350 more rows

Or, to generate data beyond the sample:

library(lubridate)
UKLungDeaths %>% 
  filter(year(index) <= 1978) %>% 
  model(lm = TSLM(mdeaths ~ fourier("year", K = 4) + fdeaths)) %>% 
  generate(
    UKLungDeaths %>% filter(year(index) > 1978),
    times = 5
  )
## # A tsibble: 60 x 4 [1M]
## # Key:       .rep, .model [5]
##    .model  .rep    index  .sim
##    <chr>  <int>    <mth> <dbl>
##  1 lm         1 1979 Jan 2047.
##  2 lm         1 1979 Feb 1875.
##  3 lm         1 1979 Mar 1723.
##  4 lm         1 1979 Apr 1685.
##  5 lm         1 1979 May 1374.
##  6 lm         1 1979 Jun 1105.
##  7 lm         1 1979 Jul 1260.
##  8 lm         1 1979 Aug 1115.
##  9 lm         1 1979 Sep 1020.
## 10 lm         1 1979 Oct 1173.
## # … with 50 more rows

5.4 Visualisation

Different plots are appropriate for visualising each type of model. For example, a plot of an ARIMA model may show the AR and/or MA roots from the model on a unit circle. A linear model has several common plots, including plots showing “Residuals vs Fitted” values, normality via a Q-Q plot, and measures of leverage. These model plots are further extended by the visreg package to show the affects of terms on the model’s response. Some models currently have no model-specific plots, such as ETS, which defaults to showing a components plot using the estimated states.

Visualising these models poses a substantial challenge for consistency across models, and is made more difficult as batch modelling becomes commonplace.