Visualizing the Effect of Multicollinearity on Multiple Regression Model | by Boriharn K | Jun, 2023


Start with preparing multiple regression models for plotting. Firstly, we will create two models from a dataset; one with moderately correlated variables and another with highly correlated variables.

Then we will slightly modify the data to see which model will be affected more by the small modification. Start with importing libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Getting data

For example, this article will work with the Cars Data dataset, which has a public domain license. It can be freely and directly downloaded from the Seaborn library with seaborn.load_dataset() function.

The dataset contains 392 cars’ prices and features between 1970–1982 in USA, Europe, and Japan. More information about the dataset can be found here: link.

data = sns.load_dataset('mpg')
data.dropna(inplace=True) #remove null rows
data.head()

Doing exploratory data analysis to understand the dataset is always a good idea.

data.describe()

The ranges (max value – min value) in the columns are quite different. Thus, performing the standardization can help interpret the coefficients later.

data = data.iloc[:,0:-2]    # select columns

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_s = scaler.fit_transform(data)
df = pd.DataFrame(df_s, columns=data.columns)
df.head()

Continue with plotting a heat map to show the correlation among the variables.

plt.figure(figsize=(9, 6))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
plt.show()

The result shows that there are some variables highly correlated with other variables. If every predictor is put in a multiple regression model to predict the ‘mpg’ value, multicollinearity will affect the model.

Calculating the VIF values

This can be quickly proved by calculating the VIF values with the Statsmodels library.

from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

vif_data = pd.DataFrame()
vif_data["feature"] = df.columns
# calculating VIF for each feature
vif_data["VIF"] = [vif(df.values, i) for i in range(len(df.columns))]

print(vif_data)

Some VIF values are pretty high (> 5), which can be interpreted as highly correlated. If we directly put every predictor in a multiple regression model, the model can suffer from the multicollinearity problem.

Next, only some predictors will be selected. This article will work with two models: one with moderately correlated and another with highly correlated predictors.

The first multiple regression model uses ‘cylinders’ and ‘acceleration’ to predict the ‘mpg,’ while the second uses’ cylinders’ and ‘displacement.’ Let’s calculate the VIF values again.

df1 = df[['mpg', 'cylinders', 'acceleration']]
df2 = df[['mpg', 'cylinders', 'displacement']]

# VIF dataframe1
vif_data1 = pd.DataFrame()
vif_data1['feature'] = df1.columns
vif_data1['VIF'] = [vif(df1.values, i) for i in range(len(df1.columns))]
print('model 1')
print(vif_data1)

# VIF dataframe2
vif_data2 = pd.DataFrame()
vif_data2['feature'] = df2.columns
vif_data2['VIF'] = [vif(df2.values, i) for i in range(len(df2.columns))]
print('model 2')
print(vif_data2)

VIF values of both models. Image by Author.

The difference between these two models is changing from the variable ‘acceleration’ to ‘displacement.’ However, it can be noticed that none of the VIF values in the first model are higher than 5, while the second model has relatively high VIF values, more than 10.

Creating a multiple regression model

We will use the OLS function from the statsmodels library to create a multiple regression model.

import statsmodels.api as sm
y1 = df1[['mpg']]
X1 = df1[['cylinders', 'acceleration']]

lm1 = sm.OLS(y1, X1)
model1 = lm1.fit()
model1.summary()

Summary table of the multiple regression model with moderately correlated variables. Image by Author.

Plotting the model

From the obtained model, we will define a function to run the model on a mesh grid for use in the next step.

def run_model(v1, v2, pd_):
mesh_size = 0.02
x_min, x_max = pd_[[v1]].min()[0], pd_[[v1]].max()[0]
y_min, y_max = pd_[[v2]].min()[0], pd_[[v2]].max()[0]
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
return xx, yy, xrange, yrange

Here comes the fun part. Let’s plot the model with the Plotly library, which helps create an interactive plot easily. Thus, we can interact with the visualization, such as zooming or rotating.

import plotly.express as px
import plotly.graph_objects as go
from sklearn.svm import SVR

#run and apply the model
xx1, yy1, xr1, yr1 = run_model('cylinders', 'acceleration', X1)
pred1 = model1.predict(np.c_[xx1.ravel(), yy1.ravel()])
pred1 = pred1.reshape(xx1.shape)

# plot the result
fig = px.scatter_3d(df1, x='cylinders', y='acceleration', z='mpg')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xr1, y=yr1, z=pred1, name='pred1'))
fig.show()

Multiple regression model with moderately correlated variables. Image by Author.

We can perform the same process with the second model, which has the multicollinearity problem.

y2 = df2[['mpg']]
X2 = df2[['cylinders', 'displacement']]

lm2 = sm.OLS(y2, X2)
model2 = lm2.fit()
model2.summary()

Summary table of the multiple regression model with multicollinearity. Image by Author.
#run and apply the model
xx2, yy2, xr2, yr2 = run_model('cylinders', 'displacement', X2)
pred2 = model2.predict(np.c_[xx2.ravel(), yy2.ravel()])
pred2 = pred2.reshape(xx2.shape)

# plot the result
fig = px.scatter_3d(df2, x='cylinders', y='displacement', z='mpg')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xr2, y=yr2, z=pred2, name='pred2'))
fig.show()

Multiple regression model with highly correlated variables. Image by Author.



Source link

Leave a Comment