Modeling EEG Signals using Polynomial Regression in R | by Mala Deep | Jun, 2023


Data set

We will be using the dataset provided by the Swartz Center for Computational Neuroscience, which contains time, four input signals (x1, x2, x3, and x4), and one output signal (y).

Below is a sample of the data set:

Finding the Best Fit Model for Modeling EEG Signals using Polynomial Regression in R
Snapshot of the dataset provided by Swartz Center for Computational Neuroscience. Image by the author.

Challenge

Here, we are going to explain the relationship between the input EEG signals and the output EEG signals based on the assumption that the relationship can be expressed as a polynomial regression model.

We are given five different nonlinear polynomial regression models, out of which we need to find the one that is most suitable.

5 different models for nonlinear polynomial regression models.
We are given five different nonlinear polynomial regression models. Image by the author

To solve the problem, the following steps would be taken:

  1. Using least square, estimate the model parameter
  2. Calculating model residual errors (RSS)
  3. Calculating log-likelihood functions
  4. Calculating Akaike Information Criterion (AIC) and Bayesian Information Criteria (BIC)
  5. Checking the distribution of model prediction errors
  6. Selecting the best regression model

Step 1. Using least square, estimate the model parameter

When we have no idea (unknown) about the true value of the distribution, we use the concept of an estimator (random variable) [3]. In other words, we are using the estimator variable to estimate the true value of the EEG data distribution that relates the input and output variables.

Here, the estimator variable is represented by “θ” and can take on multiple values such as θ1, θ2, …, θbias. Now, the least squares method (LSM) is used to calculate the estimator model parameters for different candidate models of EEG data, and the LSM (θ̂) is used to estimate the true value of the distribution by minimizing the sum of the squared residuals between the predicted and actual values of the output variable [4], which is expressed by the following formula:

Least square estimator formula. Polynomial Regression in R
Formula for Least square estimator. Image by the author.

Now, to calculate the least squares, we first need to format the input data by binding the appropriate columns or values from the EEG data set. With the function cbind(). Once the input data is correctly formatted, we can then use the least squares formula as mentioned above, and using the built-in solve linear equations function called solve(), we find the θ̂. We use solve() because it’s more efficient and less error-prone [5].

Similarly, as we have 5 models, for each model, we will follow the same process: creating cbind, calculating estimated values of the model parameters, θ̂ and printing the value of theta_hat.

Least square estimator formula for 5 different models. Polynomial Regression in R
θ̂ for each 5 models. Image by the author.

Step 2. Calculating model residual errors (RSS)

The residual sum of squares (RSS), also termed the sum of squared errors of prediction (SSE) or the sum of squared residuals (SSR), is a measure of discrepancy between the data and an estimation model. It is calculated by subtracting the average square of the actual values from the estimated values of the dependent variable based on the model parameters [6].

We usually want to minimize the error, so the smaller the error, the better the estimation power of the regression and the better the fit of the model. On the other hand, a larger RSS indicates a worse fit of the model to the data [7]. It is worth mentioning that the RSS is never negative because the squares of the residuals are always non-negative [8].

To calculate RSS, we first have to calculate the error of every model 1–5 with the help of (θ̂) that we calculated in the above step, and RSS is mathematically presented as:

model residual errors (RSS) formula. Polynomial Regression in R
Formula for model residual errors (RSS). Image by the author.

For calculating RSS for model 1, we have the following code:

Similarly, we will calculate the RSS value for each model.

So, we have an RSS value for each model.

Table showing RSS value of each model. Image by the author.
Table showing the RSS value of each model. Image by the author.

Here, the lowest RSS value in the table is 2.1355, which is associated with model 5, followed by the second lowest RSS value of 2.1398, which is associated with model 2.

Step 3. Calculating log-likelihood functions

Now, our objective is to identify how well the measured value fits the sample data of a provided model when parameters are unknown. To meet our objective, we are going to calculate log-likelihood functions for a linear regression model using the RSS that we obtained from step 2.

Log-likelihood is a way to measure the goodness of fit for a model [9] and is used to simplify the optimization problem and avoid numerical underflow or overflow. Mathematically presented as:

Log-likelihood function formula. model residual errors (RSS) formula. Polynomial Regression in R
Formula for Log-likelihood. Image by the author.

In this task, our goal is to find the set of parameters that maximizes the probability of the observations. As the nature of the log-likelihood function is that it increases monotonically (non-decreasing) and has no local maxima, it is suitable for identifying how well the measured value fits [10]. In layman’s terms, monotonically increasing means that as the value of the independent variable (say x) increases, so does the value of the function (say y), i.e., as x increases, y can only increase and cannot ever decrease.

So, here, as the value of the log-likelihood increases, the likelihood of the data given the model parameters also increases. Therefore, finding the maximum of the log-likelihood function is the same as finding the maximum of the likelihood function, but if we go with just likelihood, then the concave nature of log is missing and we cannot get the one global maxima [11].

Therefore, using the above formula, we will first calculate the variance of the model using RSS along with the length of the Y signal and then calculate the log-likelihood function.

Using the same formula, we calculate the rest of the models.

4. Calculating Akaike Information Criterion (AIC) and Bayesian Information Criteria (BIC)

Now as we have RSS and log-likelihood value, we need model selection criteria method for which we will work with Akaike information criterion (AIC) and Bayesian Information Criteria (BIC). According to [12], model selection involves estimating the performance of various candidate models with the sole objective of choosing the best one.

Both of them can be used to compare different models and choose the best one. They are both based on the likelihood of the model given the data and the number of parameters in the model. However, the main difference between these two-model selection methods is that AIC gives less penalty to models with more parameters compared to BIC [7].

4.1 Calculating AIC for each model

Akaike information criterion (AIC) is a statistical method that aims to determine the model that truly explains the variance in the dependent variable with the fewest number of independent variables (parameters) [13]. With this, it helps to select a simpler model containing fewer parameters over a complex model.

Using the maximum likelihood estimate (step 3), the relative information value of the model and the number of parameters are determined. The formula for AIC is expressed as:

Akaike information criterion (AIC) formula.
Formula for Akaike information criterion (AIC). Image by the author.

The major goal of applying AIC in this situation is to eliminate the problem of overfitting because AIC penalizes models with more parameters and balances the trade-off between a model’s goodness of fit and complexity [14] [7].

According to [15], one thing worth mentioning is that the model fits the data better when the AIC value lower, and the AIC absolute value could be favorable or unfavorable.

Before going into the code for AIC, let’s see about BIC.

4.2 Calculating BIC for each model

As mentioned earlier, BIC is similar to AIC, but BIC will give a greater penalty on models with more parameters [16]. Similar to AIC, lower BIC values indicate better model fit. The formula for BIC is expressed as follows:

Formula for Bayesian Information Criteria (BIC). Image by the author.
Formula for Bayesian Information Criteria (BIC). Image by the author.

Using the above formula, we calculated BIC for each model, as which is similar to AIC.

Using the above formula, AIC and BIC for each model are:

AIC and BIC value of each models. Polynomial Regression in R (Step-by-Step)
AIC and BIC value of each models. Image by the author.

As we recall, AIC and BIC with lower values are the best fit, so model 2 with an AIC of -334.6489 and a BIC of -321.4357 is the best fit among the models listed.

Step 5. Checking the distribution of model prediction errors

Now that we have obtained the AIC and BIC values, we are interested in viewing the error distribution. After all, our goal is to pick one that shows the least error. Before going with the decision to graph the distribution, we need to calculate the error of each model. Then we will use a Q-Q plot (quantile-quantile plot) to visualize and compare two probability distributions using the qqnorm() function, as our assumption is that the data are independent and identically distributed.

According to [17], a Q-Q plot is formed by plotting two quantile sets against each other. In the case of the same distribution, both sets of quantiles would form a relatively straight line; however, in practice, this does not mean a hard and fast rule. In the same plot, we are going to add a reference line called the Q-Q line, which is the line of perfect fit for a normal distribution.

The qqline() function takes two arguments: the first is the data model’s prediction errors, and the second is the color (col), the width of the line (lw), and a dashed line (lty).

The Q-Q plot for model 5 shows that most of the data follows the Q-Q line (red color), so we can say the data follows a normal distribution.

Here, with the obtained Q-Q plot, we simply visually check if a data set follows a theoretical distribution or not. To formally test whether or not a data set follows a particular distribution, we need to go one step further.

Step 6. Selecting the best regression model

By completing steps 1–5, we have gathered all the necessary information to select the best candidate model. By calculating the RSS, log-likelihood function, plotting normal distribution graphs, and comparing the AIC and BIC values, we have all the information to identify the best model for our data. The best model fit based on AIC and BIC would be model 2, as it has the lowest value. To verify that the selected model 2 is a good candidate, we will look at the Q-Q plot.

Looking at the Q-Q plot, except for model 3, all models seem to have the same nature; however, looking at the position of the Q-Q line, model 2 seems to be the most suitable one.

To get more inclination toward the decision of picking up model 2, we would like to plot a histogram to show the distribution of residuals. For easy viewing, we will plot all the histogram into 3 rows and 2 columns using par(mfrow = c(3, 2)).

Distribution of prediction error using histogram in R Polynomial Regression.
Distribution of prediction error using histogram. Image by the author.

Looking at the distribution of each model, models 2, 5, and 6 both seem to have a normal distribution.

Additionally, we will go one step further and see if model 2 is more suitable than the rest. Considering additional factors will help us determine the best model in this scenario. We will compare the number of parameters in each model based on the interpretability of the model, i.e., a simpler model with fewer parameters is easier to interpret and understand [18].

Looking at each length of parameter, the lowest is model 3, with 3 numbers of parameters, but it doesn’t follow a normal distribution and is skewed, and the next is model 4, with 4 parameters, but its AIC and BIC are greater than those of model 2.

So as a conclusion of AIC, BIC, Q-Q plot, and extra interpretability, we have picked up model 2 as the best fit model.

Modeling for EEG signals for Polynomial Regression in R. Code
Unveiling the champion: model 2 emerges as the optimal fit for EEG signal modeling using polynomial regression in R. Image by the author.

So, by using the least squares, estimating model parameters, model residual errors (RSS), log-likelihood functions, Akaike Information Criterion(AIC) and Bayesian Information Criterion(BIC), and visualizing prediction errors with Q-Q plots, we unveil the champion: Model 2 emerges as the optimal fit for EEG signal modeling using polynomial regression in R.

This completes our polynomial regression in R.

My GitHub repository has the entire working code available, or you can access RPubs to see it online.



Source link

Leave a Comment