Local vs Global Forecasting: What You Need to Know | by Davide Burba


A comparison of Local and Global approaches to time series forecasting, with a Python demonstration using LightGBM and the Australian Tourism dataset.

Image by Silke from Pixabay

To jump to the Python example, click here!

What is Local forecasting?

Local forecasting is the traditional approach where we train one predictive model for each time series independently. The classical statistical models (like exponential smoothing, ARIMA, TBATS, etc.) typically use this approach, but it can also be used by standard machine learning models via a feature engineering step.

Local forecasting has advantages:

  • It’s intuitive to understand and implement.
  • Each model can be tweaked separately.

But it also has some limitations:

  • It suffers from the “cold-start” problem: it requires a relatively large amount of historical data for each time series to estimate the model parameters reliably. It also makes it impossible to predict new targets, like the demand for a new product.
  • It can’t capture the commonalities and dependencies among related time series, like cross-sectional or hierarchical relationships.
  • It’s hard to scale to large datasets with many time series, as it requires fitting and maintaining a separate model for each target.

What is Global forecasting?

Image by PIRO from Pixabay

Global forecasting is a more modern approach, where multiple time series are used to train a single “global” predictive model. By doing so, it has a larger training set and it can leverage shared structures across the targets to learn complex relations, ultimately leading to better predictions.

Building a global forecasting model typically involves a feature engineering step to build features like:

  • Lagged values of the target
  • Statistics of the target over time-windows (e.g. “mean in the past week”, “minimum in the past month”, etc.)
  • Categorical features to distinguish groups of time series
  • Exogenous features to model external/interaction/seasonal factors

Global forecasting has considerable advantages:

  • It leverages the information from other time series to improve accuracy and robustness.
  • It can do predictions for time series with little to no data.
  • It scales to datasets with many time series because it requires fitting and maintaining only one single model.
  • By using feature engineering, it can handle problems such as multiple data frequencies and missing data which are more difficult to solve with classical statistical models.

But global forecasting also has some limitations:

  • It requires an extra effort to use more complex models and perform feature engineering.
  • It might need full re-training when new time-series appear.
  • If performance for one specific time-series starts to degrade, it’s hard to update it without impacting the predictions on the other targets.
  • It may require more computational resources and sophisticated methods to estimate and optimize the model parameters.

How to choose between Local and Global forecasting?

There is no definitive answer to whether local or global forecasting is better for a given problem.

In general, local forecasting may be more suitable for problems with:

  • Few time series with long histories
  • High variability and specificity among the time series
  • Limited forecasting and programming expertise

On the other hand, global forecasting may be more suitable for problems with:

  • Many time series with short histories
  • Low variability and high similarity among the targets
  • Noisy data
Image by Penny from Pixabay

In this section we showcase the differences between the two approaches with a practical example in Python using LightGBM and the Australian Tourism dataset, which is available on Darts under the Apache 2.0 License.

Let’s start by importing the necessary libraries.

import pandas as pd
import plotly.graph_objects as go
from lightgbm import LGBMRegressor
from sklearn.preprocessing import MinMaxScaler

Data Preparation

The Australian Tourism dataset is made of quarter time-series starting in 1998. In this notebook we consider the tourism numbers at a region level.

# Load data.
data = pd.read_csv('https://raw.githubusercontent.com/unit8co/darts/master/datasets/australian_tourism.csv')
# Add time information: quarterly data starting in 1998.
data.index = pd.date_range("1998-01-01", periods = len(data), freq = "3MS")
data.index.name = "time"
# Consider only region-level data.
data = data[['NSW','VIC', 'QLD', 'SA', 'WA', 'TAS', 'NT']]
# Let's give it nicer names.
data = data.rename(columns = {
'NSW': "New South Wales",
'VIC': "Victoria",
'QLD': "Queensland",
'SA': "South Australia",
'WA': "Western Australia",
'TAS': "Tasmania",
'NT': "Northern Territory",
})

Let’s have a quick look at the data:

# Let's visualize the data.
def show_data(data,title=""):
trace = [go.Scatter(x=data.index,y=data[c],name=c) for c in data.columns]
go.Figure(trace,layout=dict(title=title)).show()

show_data(data,"Australian Tourism data by Region")

Which produces the following plot:

Image by author

We can see that:

  • Data exhibits a strong yearly seasonality.
  • The scale of the time-series is quite different across different regions.
  • The length of the time-series is always the same.
  • There’s no missing data.

Data engineering

Let’s predict the value of the next quarter based on:

  • The lagged values of the previous 2 years
  • The current quarter (as a categorical feature)
def build_targets_features(data,lags=range(8),horizon=1):
features = {}
targets = {}
for c in data.columns:

# Build lagged features.
feat = pd.concat([data[[c]].shift(lag).rename(columns = {c: f"lag_{lag}"}) for lag in lags],axis=1)
# Build quarter feature.
feat["quarter"] = [f"Q{int((m-1) / 3 + 1)}" for m in data.index.month]
feat["quarter"] = feat["quarter"].astype("category")
# Build target at horizon.
targ = data[c].shift(-horizon).rename(f"horizon_{horizon}")

# Drop missing values generated by lags/horizon.
idx = ~(feat.isnull().any(axis=1) | targ.isnull())
features[c] = feat.loc[idx]
targets[c] = targ.loc[idx]

return targets,features

# Build targets and features.
targets,features = build_targets_features(data)

Train/Test split

For simplicity, in this example we backtest our model with a single train/test split (you can check this article for more information about backtesting). Let’s consider the last 2 years as test set, and the period before as validation set.

def train_test_split(targets,features,test_size=8):
targ_train = {k: v.iloc[:-test_size] for k,v in targets.items()}
feat_train = {k: v.iloc[:-test_size] for k,v in features.items()}
targ_test = {k: v.iloc[-test_size:] for k,v in targets.items()}
feat_test = {k: v.iloc[-test_size:] for k,v in features.items()}
return targ_train,feat_train,targ_test,feat_test

targ_train,feat_train,targ_test,feat_test = train_test_split(targets,features)

Model training

Now we estimate the forecasting models using the two different approaches. In both cases we use a LightGBM model with default parameters.

Local approach

As said before, with the local approach we estimate multiple models: one for each target.

# Instantiate one LightGBM model with default parameters for each target.
local_models = {k: LGBMRegressor() for k in data.columns}
# Fit the models on the training set.
for k in data.columns:
local_models[k].fit(feat_train[k],targ_train[k])

Global Approach

On the other hand, with the Global Approach we estimate one model for all the targets. To do this we need to perform two extra steps:

  1. First, since the targets have different scales, we perform a normalization step.
  2. Then to allow the model to distinguish across different targets, we add a categorical feature for each target.

These steps are described in the next two sections.

Step 1: Normalization
We scale all the data (targets and features) between 0 and 1 by target. This is important because it makes the data comparable, which in turn it makes the model training easier. The estimation of the scaling parameters is done on the validation set.

def fit_scalers(feat_train,targ_train):
feat_scalers = {k: MinMaxScaler().set_output(transform="pandas") for k in feat_train}
targ_scalers = {k: MinMaxScaler().set_output(transform="pandas") for k in feat_train}
for k in feat_train:
feat_scalers[k].fit(feat_train[k].drop(columns="quarter"))
targ_scalers[k].fit(targ_train[k].to_frame())
return feat_scalers,targ_scalers

def scale_features(feat,feat_scalers):
scaled_feat = {}
for k in feat:
df = feat[k].copy()
cols = [c for c in df.columns if c not in {"quarter"}]
df[cols] = feat_scalers[k].transform(df[cols])
scaled_feat[k] = df
return scaled_feat

def scale_targets(targ,targ_scalers):
return {k: targ_scalers[k].transform(v.to_frame()) for k,v in targ.items()}

# Fit scalers on numerical features and target on the training period.
feat_scalers,targ_scalers = fit_scalers(feat_train,targ_train)
# Scale train data.
scaled_feat_train = scale_features(feat_train,feat_scalers)
scaled_targ_train = scale_targets(targ_train,targ_scalers)
# Scale test data.
scaled_feat_test = scale_features(feat_test,feat_scalers)
scaled_targ_test = scale_targets(targ_test,targ_scalers)

Step 2: Add “target name” as a categorical feature
To allow the model to distinguish across different targets, we add the target name as a categorical feature. This is not a mandatory step and in some cases it could lead to overfit, especially when the number of time-series is high. An alternative could be to encode other features which are target-specific but more generic, like “ region_are_in_squared_km”, “is_the_region_on_the_coast “, etc.

# Add a `target_name` feature.
def add_target_name_feature(feat):
for k,df in feat.items():
df["target_name"] = k

add_target_name_feature(scaled_feat_train)
add_target_name_feature(scaled_feat_test)

For simplicity we make target_name categorical after concatenating the data together. The reason why we specify the “category” type is because it’s automatically detected by LightGBM.

# Concatenate the data.
global_feat_train = pd.concat(scaled_feat_train.values())
global_targ_train = pd.concat(scaled_targ_train.values())
global_feat_test = pd.concat(scaled_feat_test.values())
global_targ_test = pd.concat(scaled_targ_test.values())
# Make `target_name` categorical after concatenation.
global_feat_train.target_name = global_feat_train.target_name.astype("category")
global_feat_test.target_name = global_feat_test.target_name.astype("category")

Predictions on the test set

To analyze the performance of the two approaches, we make predictions on the test set.

First with the local approach:

# Make predictions with the local models.
pred_local = {
k: model.predict(feat_test[k]) for k, model in local_models.items()
}

Then with the global approach (note that we apply the inverse normalization):

def predict_global_model(global_model, global_feat_test, targ_scalers):
# Predict.
pred_global_scaled = global_model.predict(global_feat_test)
# Re-arrange the predictions
pred_df_global = global_feat_test[["target_name"]].copy()
pred_df_global["predictions"] = pred_global_scaled
pred_df_global = pred_df_global.pivot(
columns="target_name", values="predictions"
)
# Un-scale the predictions
return {
k: targ_scalers[k]
.inverse_transform(
pred_df_global[[k]].rename(
columns={k: global_targ_train.columns[0]}
)
)
.reshape(-1)
for k in pred_df_global.columns
}

# Make predicitons with the global model.
pred_global = predict_global_model(global_model, global_feat_test, targ_scalers)

Error analysis

To evaluate the performances of the two approaches, we perform an error analysis.

First, let’s compute the Mean Absolute Error (MAE) overall and by region:

# Save predictions from both approaches in a convenient format.
output = {}
for k in targ_test:
df = targ_test[k].rename("target").to_frame()
df["prediction_local"] = pred_local[k]
df["prediction_global"] = pred_global[k]
output[k] = df

def print_stats(output):
output_all = pd.concat(output.values())
mae_local = (output_all.target - output_all.prediction_local).abs().mean()
mae_global = (output_all.target - output_all.prediction_global).abs().mean()
print(" LOCAL GLOBAL")
print(f"MAE overall : {mae_local:.1f} {mae_global:.1f}n")
for k,df in output.items():
mae_local = (df.target - df.prediction_local).abs().mean()
mae_global = (df.target - df.prediction_global).abs().mean()
print(f"MAE - {k:19}: {mae_local:.1f} {mae_global:.1f}")

# Let's show some statistics.
print_stats(output)

which gives:

Mean Absolute Error on the Test Set — Image by author

We can see that the global approach leads to a lower error overall, as well as for every region except for Western Australia.

Let’s have a look at some predictions:

# Display the predictions.
for k,df in output.items():
show_data(df,k)

Here are some of the outputs:

Image by author
Image by author
Image by author

We can see that the local models predict a constant, while the global model captured the seasonal behaviour of the targets.

Conclusion

In this example we showcased the local and global approaches to time-series forecasting, using:

  • Quarterly Australian tourism data
  • Simple feature engineering
  • LightGBM models with default hyper-parameters

We saw that the global approach produced better predictions, leading to a 43% lower mean absolute error than the local one. In particular, the global approach had a lower MAE on all the targets except for Western Australia.

The superiority of the global approach in this setting was somehow expected, since:

  • We are predicting multiple correlated time-series.
  • The depth of the historical data is very shallow.
  • We are using a somehow complex model for shallow univariate time-series. A classical statistical model might be more appropriate in this setting.

The code used in this article is available here.



Source link

Leave a Comment