Concepts of Linear Regression, Bias-Variance Tradeoff, and Regularisation
What is Regression?
Before learning about linear regression, let us get ourselves accustomed to regression. Regression is a method of modeling a target value based on independent predictors. It is a statistical tool which is used to find out the relationship between the outcome variable also known as the dependent variable, and one or more variable often called as independent variables.
When and why do you use Regression?
Regression is performed when the dependent variable is of continuous data type and Predictors or independent variables could be of any data type like continuous, nominal/categorical, etc. The regression method tries to find the best fit line which shows the relationship between the dependent variable and predictors with the least error.
What is Linear Regression?
Linear Regression is the basic form of regression analysis. It assumes that there is a linear relationship between the dependent variable and the predictors/independent variables. In regression, we try to calculate the best fit line which describes the relationship between the predictors and predictive/dependent variable.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between independent variables and the mean of the dependent variable is linear.
Homoscedasticity: The variance of residuals should be equal.
Independence: Observations are independent of each other.
Normality: For any fixed value of an independent variable, the dependent variable is normally distributed.
For example, in a simple regression problem (a single x and a single y), the form of the model would be:
Y= β0 + β1x
In higher dimensions when we have more than one input (x), the line is called a plane or a hyper-plane. The representation, therefore, is in the form of the equation and the specific values used for the coefficients (like β0 and β1 in the above example).
Let’s take a subset of house prices data with the Living area and Price columns.
So, the relationship between the price of the house and the living area is linear.
Here the dependent variable or target variable is Price. The Independent variable is the Living area.
It is represented with a linear equation as follows :
θi are parameters
θ0 is zero condition
θ1 is gradient
θ: vector of all the parameters
hθ(x) is a hypothesis function, which is a function of independent variables.
x1 is the independent variable, which corresponds to the Living area
Performance of Regression
The performance of the regression model can be evaluated by using various metrics like MSE, RMSE, MAE, MAPE, R-squared, Adjusted R-squared, etc.
Mean absolute error: It is the mean of absolute values of the errors, formulated as,
Mean squared error: It is the mean of the square of errors.
Root mean squared error: It is just the square root of Mean squared error.
R²-score: R-squared statistic or coefficient of determination is a scale-invariant statistic that gives the proportion of variation in target variable explained by the linear regression model.
Adjusted R-squared statistic :
The Adjusted R-squared takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model actually increases the model fit.
Bias and Variance Tradeoff :
What is Bias?
In the simplest terms, Bias is the difference between the Predicted Value and the Expected Value. To explain further, the model makes certain assumptions when it trains on the data provided. When it is introduced to the testing/validation data, these assumptions may not always be correct.
In our model, if we use a large number of nearest neighbors, the model can totally decide that some parameters are not important at all. For example, it can just consider that the Glusoce level and the Blood Pressure decide if the patient has diabetes. This model would make very strong assumptions about the other parameters not affecting the outcome. You can also think of it as a model predicting a simple relationship when the data points clearly indicate a more complex relationship.
Mathematically, let the input variables be X and a target variable Y. We map the relationship between the two using a function f.
Y = f(X) + e
Here ‘e’ is the error that is normally distributed. The aim of our model f’(x) is to predict values as close to f(x) as possible. Here, the Bias of the model is:
Bias[f’(X)] = E[f’(X) — f(X)]
What is Variance?
Contrary to bias, the Variance is when the model takes into account the fluctuations in the data i.e. the noise as well. So, what happens when our model has a high variance?
The model will still consider the variance as something to learn from. That is, the model learns too much from the training data, so much so, that when confronted with new (testing) data, it is unable to predict accurately based on it.
Mathematically, the variance error in the model is:
Since in the case of high variance, the model learns too much from the training data, it is called overfitting.
To make it simpler, the model predicts very complex relationships between the outcome and the input features when a quadratic equation would have sufficed. This is how a classification model would look like when there is a high variance error/when there is overfitting:
- A model with a high bias error under fits data and makes very simplistic assumptions about it
- A model with a high variance error overfits the data and learns too much from it
- A good model is where both Bias and Variance errors are balanced
The center i.e. the bull’s eye is the model result we want to achieve that perfectly predicts all the values correctly. As we move away from the bull’s eye, our model starts to make more and more wrong predictions.
A model with low bias and high variance predicts points that are around the center generally, but pretty far away from each other. A model with high bias and low variance is pretty far away from the bull’s eye, but since the variance is low, the predicted points are closer to each other.
In terms of model complexity, we can use the following diagram to decide on the optimal complexity of our model.
Ridge, Lasso, and Elastic Net Regression
Ridge regression is a small extension of the OLS cost function where it adds a penalty to the model as the complexity of the model increases. The more predictors(mⱼ) you have in your data set the higher the R² value, and the higher the chance your model will overfit to your data. Ridge regression is often referred to as L2 norm regularization.
Keep in mind that the goal is to minimize the cost function, so the larger the penalty term (λ * sum(mⱼ²)) the worse the model will perform. This function penalizes your model for having too many or too large predictors.
The most common use of Ridge regression is to be preemptive in addressing overfitting concerns. Ridge regression is a good tool for handling multicollinearity when you must keep all your predictors.
Ridge regression works well if there are many predictors of about the same magnitude. This means all predictors have similar power to predict the target value.
When looking at the equation below and thinking to yourself “that looks almost identical to Ridge regression.” Well, you’re right for the most part. Lasso differs from Ridge regression by summing the absolute value of the predictors (mⱼ) instead of summing the squared values.
Lasso is an acronym that stands for “Least Absolute Shrinkage and Selection Operator.” Due to the penalty term not being squared, some values can reach 0. When a predictor coefficient (mⱼ) reaches 0 that predictor does not affect the model.
Lasso tends to do well if there are few significant predictors and the magnitudes of the others are close to zero. Another way of saying, a few variables are much better predictors of the target value than the other predictors.
Elastic Net Regression
So, what if I don’t want to choose? What if I don’t know what I want or need? Elastic Net regression was created as a critique of Lasso regression. While it helps in feature selection, sometimes you don’t want to remove features aggressively. As you may have guessed, Elastic Net is a combination of both Lasso and Ridge regressions.
As you can see in the picture above there are now two λ terms. λ₁ is the “alpha” value for the Lasso part of the regression and λ₂ is the “alpha” value for the Ridge regression equation. When using sci-kit learn’s Elastic Net regression the alpha term is a ratio of λ₁:λ₂. When setting the ratio = 0 it acts as a Ridge regression, and when the ratio = 1 it acts as a Lasso regression. Any value between 0 and 1 is a combination of Ridge and Lasso regression.