## Data set

We will be using the dataset provided by the Swartz Center for Computational Neuroscience, which contains time, four input signals (x1, x2, x3, and x4), and one output signal (y).

Below is a sample of the data set:

## Challenge

Here, we are going to explain the relationship between the input EEG signals and the output EEG signals based on the assumption that the relationship can be expressed as a polynomial regression model.

We are given five different nonlinear polynomial regression models, out of which we need to find the one that is most suitable.

To solve the problem, the following steps would be taken:

- Using least square, estimate the model parameter
- Calculating model residual errors (RSS)
- Calculating log-likelihood functions
- Calculating Akaike Information Criterion (AIC) and Bayesian Information Criteria (BIC)
- Checking the distribution of model prediction errors
- Selecting the best regression model

## Step 1. Using least square, estimate the model parameter

When we have no idea (unknown) about the true value of the distribution, we use the concept of an estimator (random variable) [3]. In other words, we are using the estimator variable to estimate the true value of the EEG data distribution that relates the input and output variables.

Here, the estimator variable is represented by “θ” and can take on multiple values such as θ1, θ2, …, θbias. Now, the least squares method (LSM) is used to calculate the estimator model parameters for different candidate models of EEG data, and the LSM (θ̂) is used to estimate the true value of the distribution by minimizing the sum of the squared residuals between the predicted and actual values of the output variable [4], which is expressed by the following formula:

Now, to calculate the least squares, we first need to format the input data by binding the appropriate columns or values from the EEG data set. With the function cbind(). Once the input data is correctly formatted, we can then use the least squares formula as mentioned above, and using the built-in solve linear equations function called solve(), we find the θ̂. We use solve() because it’s more efficient and less error-prone [5].