Logarithmic Transformation for Beginners | by Jae Kim | May, 2023


Consider, for simplicity, Y = 1 + 2X, where Y is the response variable and X is the input variable. We are often interested in how much Y changes in response to a change in X. Let Δ denote the change operator. That is,

ΔY = Y1 — Y0: change of Y from Y0 to Y1; and

ΔX = X1 — X0: change of X from X0 to X1.

Suppose, with our example (Y = 1 + 2 X), X changes from 1 to 3. Then, in response to this, Y changes from 3 to 7. That is, ΔY = 4 and ΔX = 2.

The slope (or derivative) measures how much Y changes in response to one-unit change of X. It is defined as

β ≡ ΔY/ΔX,

and β = 2 in our example. A slope coefficient that we encounter in a linear regression or a machine learning model has the same interpretation. The slope is a standardized measure, but it is unit-dependent. That is, interpretation of a slope coefficient requires a careful consideration of their units.

Consider the function Y = log(X), where log() denotes the natural logarithm.

Image Created by the Author using R function “curve”

As plotted above, the function provides a monotonic transformation of X into a smaller scale, applicable for X > 0.

The function has a special property where the slope of the function at a point of X is 1/X. That is,

This means that a change of Y is equal to ΔX/X, which represents a proportional change of X.

As an example, suppose X has increased from 2000 to 2010 (0.5% increase).

As the above table shows, this means

Δlog(X) = log(2010) — log(2000) = 7.606–7.601=0.005,

which is equal to (X1-X0)/X0 = (2010–2000)/2000.

That is, 100Δlog(X) = 100ΔX/X, which measures % change of X, at a given point of X.

In general, for any variable Z, 100Δlog(Z) = 100ΔZ/Z, and it measures % change of Z, at a given point of Z.

As a result of the above-mentioned property of the logarithmic function, the log-transformed regressions can be used for a unit-free interpretation of a relationship, as the following table shows:

Table 1 (Image Created by the Author)
  • Case 1: both Y and X are not transformed to natural logarithm. In this case, the slope coefficient β measures how much Y changes in response to one-unit change of X. That is, its interpretation depends on the units of Y and X.
  • Case 2: both Y and X are transformed to natural logarithm. The slope coefficient in this case measures a percentage change of Y in response to 1% change of X. This measure is called the elasticity of Y with respect to X, a unit-free measure of association widely used in economics.
  • Case 3: only X is transformed to natural logarithm. In this case, the slope coefficient is interpreted as (β/100) unit change of Y in response to 1% change of X.
  • Case 4: only Y is transformed to natural logarithm. In this case, the slope coefficient is interpreted as 100β% change of Y in response to 1 unit change of X.

Case 2 is useful when both Y and X are continuous variables in different units. Case 3 may be useful when Y takes a negative value or when Y is already expressed in percentage. Case 4 may be used when X is an indicator variable or a discrete variable. Hence, which case to take is up to the researcher, depending on the context of the research.

The scale-down effect of the transformation can bring other benefits, which can deliver a more accurate or reliable estimation of the relationship.

  • When Y and X are in large numbers, the variability of estimation can be excessive. The log transformation monotonically transforms the data into a smaller scale, with a much smaller variability, which in turn can reduce the variability of estimation.
  • In this process, the effect or influence of outliers can be substantially mitigated.
  • As a result, the intrinsic relationship can be better revealed with improved linearity than otherwise.
  • The transformed data can be closer to a normal distribution.

I have selected a data set for Chicago house price from Kaggle, which can be accessed from here (CC0). The variables include

  • Price: price of house
  • Bedroom: number of bedrooms
  • Space: size of house (in square feet)
  • Lot: width of a lot
  • Tax: amount of annual tax
  • Bathroom: number of bathrooms
  • Garage: number of garage
  • Condition: condition of house (1 if good , 0 otherwise)

Note that the units of Price, Lot and Tax variables are not provided in the data source.

Figure 1 below presents Q-Q plots of the Price variable and log(Price).

Figure 1 (Image Created by the Author)

The blue straight line is the reference line where the sample quantiles should exactly match those of a normal distribution, under the assumption that the sample follows a normal distribution. The blue band indicates a 95% confidence band for the sample quantiles. If a distribution follows a normal distribution, then sample quantiles should be closely located to the reference line. A deviation from the reference line is statistically negligible at the 5% level of significance, if they are within the 95% confidence band.

As clear from Figure 1, the Price variable shows a degree of departure from normality, with a number of sample quantiles outside the 95% confidence band. However, log(Price) has nearly all of the sample quantiles within this band, indicating that the variable becomes closer to a normality as a result of the logarithmic transformation.

Figure 2 (Image Created by the Author)

Figure 2 above shows the scatter plots of Price against Tax; and log(Price) against log(Tax). From the former, one may argue that the relationship is non-linear, with the presence of several outliers. With the log transformation, the effect of these outliers looks substantially diminished, and the relationship may well be considered to be linear.

Now I run the regression of Price against all other variables as explanatory variables.

  • Model 1: all variables are included as they are; and
  • Model 2: all continuous variables (Price, Space, Lot, Tax) are transformed to natural logarithm, while other (discrete) variables are included as they are.

The regression results are tabulated in Table 2 below:

Table 2 (Image Created by the Author)
  • Both models show sufficiently large R² values of more than 0.70. However, the two values are not comparable because the dependent variables are in different scales.
  • Model 2 has all coefficients statistically significant at the 5% level. In contrast, Model 1 has two coefficients (those of Tax and Condition) that are statistically insignificant at a conventional level of significance, although the associated variables are economically important.
  • In Model 1, the coefficient of Tax is small and statistically insignificant; but that of log(Tax) in Model 2 is large and statistically significant. This may be closely related to the observation made in the scatter plots in Figure 2, in relation to linearization by logarithmic transformation.

To interpret the Space coefficient,

  • from Model 1, a house with a 100-square-foot extra space is expected to have a higher price by 1.3 units (other factors being held constant): see Case 1 in Table 1; and
  • from Model 2, a house with 10% larger space is expected to have a higher price by 1.63% (other factors being held constant): see Case 2 in Table 1.

To interpret the Bathroom coefficient,

  • from Model 1, a house with an extra bathroom is expected to have a higher price by 7.251 units (other factors being held constant);
  • from Model 2, a house with an extra bathroom is expected to have a higher price by 13.3% (=100 × 0.133): see Case 4 in Table 1.

Other coefficients can also be interpreted in a similar manner.

With logarithmic transformation, the researchers can have unit-free interpretation of association, which is a lot easier to understand and interpret.



Source link

Leave a Comment