Beyond The VIF: Collinearity Analysis for Bias Mitigation and Predictive Accuracy | by Ruth Eneyi Ikwu | Jul, 2023


Subsequently, the corresponding change in the significance of the remaining variables on the outcome should be quantified in terms of percentage. This procedure will help expose IVs explained by other IVs, pretending to be important.

There are three main effects (of concern) collinear variables above the red line may have on the relationship between other IV(s) and the dependent variable. They may mediate (suppress), confound (exaggerate) or moderate (change).

The concepts of moderators, mediators and confounders is not really discussed in Machine Learning. These concepts are often left to the ‘social scientists’, after all, they are ones who ever need to ‘interprete’ their coefficients. However, these concepts explain how collinearity can introduce bias to ML models.

Note that these effects cannot be truly established without deeper causal analysis, but for a bias removal pre-processing step, we can use simple definitions of these concepts to filter these relationships.

Figure 8: mediating Relationships in Boston Housing Dataset

A mediator explains ‘how’ the IV and DV are related i.e the process by which they are related. A mediator must meet three criteria:

a) Be significantly predictive of the first IV, b) be significantly predictive of the dependent variable and c) be significantly predictive of the dependent variable in the presence of the first IV.

It ‘mediates’ because its inclusion does not change the direction of the relationship between the first IV and the dependent variable. If a mediator is removed from a model, the strength of the relationship between the first IV and dependent variable should become stronger because the mediator was truly accounting for part of that effect.

## finding mediators
cc = pd.DataFrame(conf)
co_sig = (cc['CO_sig'] < 0.01) # The C is independetly predictive of Y
io_sig = (cc['IO_sig'] < 0.01) # The I is independetly predictive of Y
icoi_sig = (cc['ICO_I_sig'] < 0.01) # The I and C are predictive of Y
icoc_sig = (cc['ICO_C_sig'] < 0.05) # The C is independetly predictive of Y in the presence of I
icoci_sig = (cc['IO_sig'] > cc['ICO_I_sig']) # Direct relationship between I and O should be stronger without C

For example, in the relationship between (RM), (TAX) and (MEDV), the number of rooms potentially explains how property tax is related to its property value.

Confounders are elusive as it is difficult to define them in terms of correlations and significance. A confounding variable is an external variable that correlates with both the dependent and independent variables, thus potentially distorting the perceived relationship between them. As opposed to mediators, the relationship between the first IV and the dependent variable is meaningless. There is also no guarantee that removing the confounder will weaken or strengthen the relationship between the first IV and the dependent variable.

The number of rooms in a house can either mediate or confound the relationship between the proportion of black population and property value. Well, according to this paper, it depends on the relationship between (B) and (RM). If the relationship between (RM <-> MEDV) and (RM <-> B) are in the same direction, removing (RM) should weaken the effect of (B) on (MEDV). However, if relationship between (RM <-> MEDV) and (RM <-> B) are in the opposite direction, removing (RM) should strengthen (B).

(RM <—> MEDV) and (RM <-> B) are in the same direction (subplot 3 of figure 1), however, removing (RM) strengthens the effect of (B).

But see the figure below, where we there is a good decision boundary for a third IV in the relationship between the first IV and DV. This indicates a different type of relationship between (RM) and (TAX) based on the value of (B).

Figure 9: Moderating Regression

With moderators, the relationship between the first IV and the dependent variable is different based on the value of the moderator. What property tax can you expect to pay on a house that costs $100,00? Well, it depends on the proportion of black population in the town and the number of rooms in that house. Infact, there are a set of towns whose property tax stays consistent, regardless of the number of rooms, provided (B) remains below a certain threshold.

Figure 10: Moderating Relationship Boston Housing Dataset

Moderators are usually categorical features or groups in the data. Conventional pre-processing steps for groups create dummy variables for each group label. This potentially addresses any moderating effect from that group on the dependent variable. However, ranked variables or continuous variables with low variance (B) can also be moderators.

In conclusion, while collinearity is a challenging issue in regression modelling, its careful evaluation and management can enhance the predictive power and reliability of machine learning models. The ability to account for information loss, provides an effective framework for feature selection, enabling the balance of explainability and predictive accuracy.



Source link

Leave a Comment