# Visualizing the Effect of Multicollinearity on Multiple Regression Model | by Boriharn K | Jun, 2023

Start with preparing multiple regression models for plotting. Firstly, we will create two models from a dataset; one with moderately correlated variables and another with highly correlated variables.

Then we will slightly modify the data to see which model will be affected more by the small modification. Start with importing libraries.

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline`

## Getting data

For example, this article will work with the Cars Data dataset, which has a public domain license. It can be freely and directly downloaded from the Seaborn library with `seaborn.load_dataset()` function.

The dataset contains 392 cars’ prices and features between 1970–1982 in USA, Europe, and Japan. More information about the dataset can be found here: link.

`data = sns.load_dataset('mpg')data.dropna(inplace=True)       #remove null rowsdata.head()`

Doing exploratory data analysis to understand the dataset is always a good idea.

`data.describe()`

The ranges (max value – min value) in the columns are quite different. Thus, performing the standardization can help interpret the coefficients later.

`data = data.iloc[:,0:-2]    # select columnsfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()df_s = scaler.fit_transform(data)df = pd.DataFrame(df_s, columns=data.columns)df.head()`

Continue with plotting a heat map to show the correlation among the variables.

`plt.figure(figsize=(9, 6))sns.heatmap(df.corr(), cmap='coolwarm', annot=True)plt.show()`

The result shows that there are some variables highly correlated with other variables. If every predictor is put in a multiple regression model to predict the ‘mpg’ value, multicollinearity will affect the model.

## Calculating the VIF values

This can be quickly proved by calculating the VIF values with the Statsmodels library.

`from statsmodels.stats.outliers_influence import variance_inflation_factor as vifvif_data = pd.DataFrame()vif_data["feature"] = df.columns# calculating VIF for each featurevif_data["VIF"] = [vif(df.values, i) for i in range(len(df.columns))]print(vif_data)`

Some VIF values are pretty high (> 5), which can be interpreted as highly correlated. If we directly put every predictor in a multiple regression model, the model can suffer from the multicollinearity problem.

Next, only some predictors will be selected. This article will work with two models: one with moderately correlated and another with highly correlated predictors.

The first multiple regression model uses ‘cylinders’ and ‘acceleration’ to predict the ‘mpg,’ while the second uses’ cylinders’ and ‘displacement.’ Let’s calculate the VIF values again.

`df1 = df[['mpg', 'cylinders', 'acceleration']]df2 = df[['mpg', 'cylinders', 'displacement']]# VIF dataframe1vif_data1 = pd.DataFrame()vif_data1['feature'] = df1.columnsvif_data1['VIF'] = [vif(df1.values, i) for i in range(len(df1.columns))]print('model 1')print(vif_data1)# VIF dataframe2vif_data2 = pd.DataFrame()vif_data2['feature'] = df2.columnsvif_data2['VIF'] = [vif(df2.values, i) for i in range(len(df2.columns))]print('model 2')  print(vif_data2)`

The difference between these two models is changing from the variable ‘acceleration’ to ‘displacement.’ However, it can be noticed that none of the VIF values in the first model are higher than 5, while the second model has relatively high VIF values, more than 10.

## Creating a multiple regression model

We will use the OLS function from the statsmodels library to create a multiple regression model.

`import statsmodels.api as smy1 = df1[['mpg']]X1 = df1[['cylinders', 'acceleration']]lm1 = sm.OLS(y1, X1)model1 = lm1.fit()model1.summary()`

## Plotting the model

From the obtained model, we will define a function to run the model on a mesh grid for use in the next step.

`def run_model(v1, v2, pd_):mesh_size = 0.02x_min, x_max = pd_[[v1]].min()[0], pd_[[v1]].max()[0] y_min, y_max = pd_[[v2]].min()[0], pd_[[v2]].max()[0] xrange = np.arange(x_min, x_max, mesh_size)yrange = np.arange(y_min, y_max, mesh_size)xx, yy = np.meshgrid(xrange, yrange)return xx, yy, xrange, yrange`

Here comes the fun part. Let’s plot the model with the Plotly library, which helps create an interactive plot easily. Thus, we can interact with the visualization, such as zooming or rotating.

`import plotly.express as pximport plotly.graph_objects as gofrom sklearn.svm import SVR#run and apply the modelxx1, yy1, xr1, yr1 = run_model('cylinders', 'acceleration', X1)pred1 = model1.predict(np.c_[xx1.ravel(), yy1.ravel()])pred1 = pred1.reshape(xx1.shape)# plot the resultfig = px.scatter_3d(df1, x='cylinders', y='acceleration', z='mpg')fig.update_traces(marker=dict(size=5))fig.add_traces(go.Surface(x=xr1, y=yr1, z=pred1, name='pred1'))fig.show()`

We can perform the same process with the second model, which has the multicollinearity problem.

`y2 = df2[['mpg']]X2 = df2[['cylinders', 'displacement']]lm2 = sm.OLS(y2, X2)model2 = lm2.fit()model2.summary()`
`#run and apply the modelxx2, yy2, xr2, yr2 = run_model('cylinders', 'displacement', X2)pred2 = model2.predict(np.c_[xx2.ravel(), yy2.ravel()])pred2 = pred2.reshape(xx2.shape)# plot the resultfig = px.scatter_3d(df2, x='cylinders', y='displacement', z='mpg')fig.update_traces(marker=dict(size=5))fig.add_traces(go.Surface(x=xr2, y=yr2, z=pred2, name='pred2'))fig.show()`