Multiple Time Series Forecast & Demand Pattern Classification using R — Part 3

7 min readFeb 16, 2022

Seasonal feature, Calendar event features, lag features, Moving Average features, Pricing feature

This is a continuation of my previous blog. The previous blog looked at how we can perform traditional statistical forecasts for multiple time series and what the cons of fitting traditional time series are. This series will have the 5 following parts:

Part 1: Data Cleaning & Demand categorization.
Part 2: Fitting statistical Time Series models (ARIMA, ETS, CROSTON etc.) using fpp3 (tidy forecasting) R Package.
Part 3: Time Series Feature Engineering using timetk R Package.
Part 4: Fitting Machine Learning models (XGBoost, Random Forest, etc.) & Hyperparameter tuning using modeltime & tidymodels R packages.
Part 5: Fitting Deeplearning models (NBeats & DeepAR) & Hyperparameter tuning using modeltime, modeltime.gluonts R packages.

In this blog, I will explain the common features in time series & how to create these features on R. Having the appropriate or suitable features is crucial to get the best accuracy in Machine Learning (ML) models.

Let’s get started!

01. Seasonal Features

Creating this feature in R is simple, we just need to decompose the date variable into several time-related features like the year, month, week number, iso week number, quarter etc.

Why do we need to decompose date?

Some ML models do not accept the datefeature as it is. However, we need to capture the seasonality and periodicity in ML models. So, instead of using the date feature, we use this transformed (year, month, etc.) feature. Now we will see how we can implement this on R.

master_data_tbl %>% # Transform date variable to multiple time features
 timetk::tk_augment_timeseries_signature(.date_var = week_date) %>%
 
 # Remove unwanted features
  dplyr::select(
    -dplyr::matches("(.iso$)|(.xts$)|(day)|(hour)|(minute)|(second)|   (am.pm)|(diff)")
  )

Here we have used the function tk_augment_timeseries_signature from timetk R package. This function decomposes the date variable into various time-related features. You can find explanations of these features from the following link. The tk_augment_timeseries_signature creates 25+ features. From these, we need to select the only relevant features. For example, in our situation, our data is weekly data so we need week, month & year related features. So, we remove hour, minute, second & day related features by using dplyr matches verbs.

02. Calendar events

Calendar events are special occasions e.g. bank holidays, promotion dates, amazon prime dates, new year dates etc. Creating this feature is different for each domain or industry, therefore, we will also need some domain knowledge.

For example, if an eCommerce business wants to sell something in China and if they want to forecast future sales, they should consider the following calendar events: Chinese New Year, National Day, Mid-Autumn Festival, etc.
Likewise, if the Air Line industry wants to forecast number of passengers, they should consider summer holidays, bank holidays as calendar events.

Why do we need to add calendar events?

These calendar events have a strong seasonal impact on the data.

In our data, we do not know what kind of holiday or calendar events we should include. However, here we will see how we can include holidays in R.

timetk::tk_make_timeseries(
  start_date = as.Date("2021-12-24"),
  end_date = as.Date("2022-01-02"),
  by = "day"
) %>%   dplyr::as_tibble(.name_repair = ~"date") %>%   timetk::tk_augment_holiday_signature(
    .date_var = date,
    .holiday_pattern = "world_christmas|world_new",
    .locale_set = "none",
    .exchange_set = "none"
  )

First, we create daily time series data using the function tk_make_timeseries from the timetkpackage. Then, we create our calendar events using tk_augment_holiday_signature from the timetk package. This is a great function that creates the following set of holidays: Individual holidays, Locale-Based Summary Sets and Stock Exchange Calendar Summary sets.

For example, here I have chosen the individual holidays for Christmas and New Year by specifying the holiday pattern world_christmas|world_new . This gives us the following features. You can find the explanation about the functiontk_augment_holiday_signature from the following link.

03 Lag Features (Trends)

One of the most important features that play a pivotal role in the forecast model is lag features.

What is Lag features & Why is it important?

A lag is a value that is shifted by a certain period. The lag features will capture the autocorrelation in data and capture more global trends.

master_data_tbl %>% 
     dplyr::group_by(center_id, meal_id) %>%
     timetk::tk_augment_lags(num_orders, .lags = 1:3)

The above code shows how to create lag features. We group each meal & centre because we need to create lags within each group. We then create lags 1 to 3 using the function tk_augment_lags . Generally for weekly data, we should consider picking lags 1, 4, 12. This is because:

lag 1 — captures the trend in the previous week.

lag 4 — captures the trend in the previous month.

lag 12 — captures the trend in the previous quarter.

When creating the lag features beware of data leakage. Hence, create the lag features on train data to avoid data leakage.

04 Pricing Features

Meals’ price can vary from one centre to another, even from one week to another within the same store. These price changes can influence sales. Rather than using absolute prices, here we will use relative price differences between meals. Based on this logic, we can create the following features:

Promotional impact

The relative price difference between the current price and historical average price.

full_tbl <- full_tbl %>% 
  group_by(center_id, meal_id) %>% 
  mutate(cum_mean_base_price = cummean(base_price)) %>% 
  # Relative difference between the current price of an item and 
  # its historical average price
  mutate(
    promotion_impact = (base_price - cum_mean_base_price) / cum_mean_base_price
  ) %>% 
  select(-c(cum_mean_base_price, checkout_price)) %>% 
  ungroup()

The price difference between centres

The relative price difference of a meal between one centre to another, to capture whether a centre has an attractive price.

store_impact_tbl <- full_tbl %>% 
  
  # Creates list of DF for each meals
  group_split(meal_id) %>% 
  
  # Map through each Meal DF
  map_df(~{
    
    # Calculate average price of a meal on each centres
    data <- .x %>% 
      group_by(center_id, meal_id) %>% 
      summarise(base_price = mean(base_price, na.rm = TRUE), 
                .groups = "drop")
    
    # Get list of each Centre id's
    center <- unique(data$center_id)
    
    # Map through each Centre Id's
    map_df(center, .f = ~{
      data %>%
        # Dummy variable to make other Centres as a one Centre
        mutate(
          center_cat = ifelse(center_id == .x, "center", "other")
        ) %>%
        group_by(meal_id, center_cat) %>%
        
        # Calculate average price of Centre and Other Centres
        summarise(
          base_price = mean(base_price, na.rm = T),
          .groups = "drop"
        ) %>%
        
        # Calculate Relative price difference
        pivot_wider(
          id_cols = meal_id,
          names_from = center_cat,
          values_from = base_price
        ) %>%
        mutate(
          store_impact = (center - other_center)/other_center,
          center_id = .x
        ) %>%
        filter(!is.na(store_impact)) %>%
        select(-c(center, other_center))
    })
  })

The above code shows how to calculate the relative price difference of a meal between centres. First, split the data by meals (This is created by the function group_split above). Then for each meal type get the average price of the corresponding centre and other centres (This is created by the function map_df ). Finally, calculate the relative price difference.

The price difference between meals

The relative price difference of a meal compared with other meals sold in the same centre and in the same meal category. This will capture the cannibalization effect. For example, if someone likes to order from Thai cuisine but does not have a specific meal preference within the Thai cuisine, they may decide based on the lowest price.

cannabilization_tbl <- full_tbl %>% 
  
  # Creates list of DF for each centres
  group_split(center_id) %>% 
  
  # Map through each Centre DF
  map_df(~{
    
    # Calculate average price of meals in centres 
    data <- .x %>% 
      group_by(meal_id) %>% 
      summarise(base_price = mean(base_price, na.rm = TRUE), 
                .groups = "drop")
    
    # Get list of each meals
    meal <- unique(data$meal_id)    # Map through each Meal Id's
    map_df(meal, .f = ~{
      data %>%
        # Create a dummy variable to make other Meals as a one Meal
        mutate(
          meal_cat = ifelse(meal_id == .x, "meal", "other_meal")
        ) %>%
        group_by(meal_cat) %>%        # Calculate average price of the Meal and Other Meals
        summarise(
          base_price = mean(base_price, na.rm = T),
          meal_id = .x,
          .groups = "drop"
        ) %>%        # Calculate Relative price difference
        pivot_wider(
          id_cols = meal_id,
          names_from = meal_cat,
          values_from = base_price
        ) %>%
        mutate(
          cannabilization = (meal - other_meal)/other_meal
        ) %>%
        select(-c(meal, other_meal))
    }) %>%
      mutate(center_id = unique(.x$center_id))
  })

In my next blog, I will explain how we can fit Machine Learning & Deep Learning models using tidymodels & modeltime R packages.

References

Dancho M, Vaughan D (2022). timetk: Calendar Features. https://business science.github.io/timetk/articles/TK01_Working_With_Time_Series_Index.html [Accessed 16 February 2022].

Dancho M, Vaughan D (2022). timetk: A Tool Kit for Working with Time Series in R. https://github.com/business-science/timetk, https://business-science.github.io/timetk/ [Accessed 16 February 2022].