The Linear Regression Model

The linear regression model is arguably the most widely used statistical and machine learning model. But what is it exactly? And how does it work?

Abstract
In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model.

The General Linear Model

Linear regression models are used in everything from biological, behavioral, environmental and social sciences to business. Because linear regression is a long-established statistical procedure, the properties of linear regression models are well understood. This document aims at summarizing these properties.

The most straightforward case of a linear model is the simple linear regression model, where we only have one single dependent variable yRn and a single independent variable xRn. (13.1)y=β0+β1x+ε Note that we are denoting any scalars as small letters (a), vectors as small bold letter (a), and matrices as large bold letter (A). Thus, we are dealing with one single coefficient β1 in , and two vectors representing the data. Additionally, we take into account a β0, which is called the intercept. We can express multiple linear regression models (with one outcome variable yRn and multiple predictors XRn×p) simply as: (13.2)y=Xβ+ε whereby βRp is a vector with weights and εRn is a vector with the error terms. The matrix product Xβ means, that we calculate the following for every observation yi: (13.3)yi=β0+j=1pβjxi,j+εi The intercept β0 is a constant value that we add to every single prediction of the model. However, for the calculation, we simply add a constant column to the data. This makes it redundant to add it as a constant term in the matrix notation; but we need to remember to add a constant column to X.

If we deal with multiple (correlated) independent variables, we might use the multivariate linear regression model. In that case, every part of the equation becomes a matrix. (13.4)Y=XB+E whereby YRn×l is a matrix with the outcome variables, XRn×p a matrix with the predictors (the design matrix), BRn×l the matrix with the coefficients and ERn×l is the matrix containing all error terms. The expression in contains all linear models such as simple linear regression (), multiple linear regression, multivariate regression (), but also the (multivariate) Anova and Ancova. That is why it is also called the general linear model.

Different Loss Functions

For the following discussion, we assume that we are dealing with a multiple linear regression model as represented in . Therefore, we are dealing with a coefficient vector β, a label vector y and a design matrix X.

In order to fit a hyperplane to our data, which is essentially what linear regression models do, we need to define what fitting well means. For that, we introduce the loss function L. (13.5)β^=argminβL(β) The task of finding the best fitting hyperplane now translates to the task of minimizing this loss function. One of the first loss function we could probably think of is to just minimize the distance of every prediction and the observation. This is called the absolute loss: (13.6)L(β)=yXβ1=i=1n|yiy^i| Note that we have to take the absolute value of the difference between predicted and observed values (denoted with.1), because we want to punish any deviation from the models prediction, both positive as well as negative!

There is a dilemma, though: the absolute loss is not differentiable. Especially historically, this imposed a big problem: Being able to differentiate the loss function L(β) means that we can arrive at a analytical solution for the loss function, which is computationally very efficient. An alternative loss function which is differentiable is the squared loss. (13.7)L(β)=yXβ2=i=1n(yiy^i)2 The squared loss is by far the most commonly used loss function for linear regression. However, one of its biggest downsides is that it’s very strongly influenced by the largest deviations from the model, so-called outliers. If we are dealing with too many outliers, we might consider to run a robust regression, which is less sensitive to deviations from any modelling assumptions. Consider for example the following loss function called the Huber loss: (13.8)L(yiy^i)={12(yiy^i)2for yiy^iδ,δ(yiy^i12δ)otherwise. It resembles the squared loss within the tuning parameter δ’s range and the absolute loss outside of it. This however imposes a immediate downside: we have to pick δ.

1 Besides the historical roots in computational efficiency, there are also other reasons for using the squared loss, discussed by the Gauss-Markov theorem.

Ordinary Least Squares

The most commonly used loss function for linear regression is the squared loss. We have seen before that this loss function comes with the nice advantage of an analytical solution to the minimisation problem; where we can calculate β in one go (this is called the closed-form solution to the minimization problem). (13.9)β^=argminβyXβ22=(XX)1Xy

L=r22=rr=(yXβ)(yXβ)=yyyXβXβy+βXXβbLβ=(yyyXβXβy+βXXβ)β=2yX+2XXβ=02XXβ=2yXβ=(XX)1yX In fact, this way of computing the coefficients β is so common that we call it the ordinary least squares (OLS) method. The Gauss-Markov theorem states that the states that the ordinary least squares predictors are the best linear unbiased estimators (BLUE) for β if the errors ε in the linear regression model are:

  1. uncorrelated (Cov(εi,εj)=0,ij),
  2. have equal variances (Var(εi)=σ2),
  3. and expectation value of zero (E(εi)=0).

This means, that if the data generating function — which is responsible for the data we are seeing — is indeed linear, and its error follows the three Gauss–Markov assumptions, the ordinary least squares estimator will have no bias and the lowest possible variance. For Gaussian errors, the OLS estimator is equal with the maximum likelihood estimation, which we get by maximizing the likelihood function. (13.10)maxβ{i=1n1σy2πexp[12(yiβxiσy)2]} In this case, xi denotes the ith row of the matrix X. However, according to the Gauss-Markov theorem, we do not need to assume this Gaussian distribution of the errors ε and will still end up with the maximum-likelihood estimates if the Gauss-Markov assumptions hold true!

This figure shows an example of the Gauss-Markov assumptions being valid (left), violated due to heteroskedasticity (center) and violated due to autocorrelation (right). The latter can be detected using (parital) autocorrelation functions.

However, even though Gauss-Markov theorem does not make any assumption about the distribution of the error term ε and we can have best unbiased estimates for the coefficients β^, we often also look at the statistical results (especially when we call the lm function in R), especially the p-values. Those are not per se a part of ordinary least squares, and they do very much assume that the errors are normally distributed!

Goodness of Fit

One of our core interest when looking at a model is its goodness of fit: how well does our model fit the underlying data. Or, on controversially, how much variance in the data is left unexplained by the model?

The difference between our model’s predictions y^ and the observed values y are called the residuals r. (13.11)r=yXβ^ Note how the coefficient vector is marked as the estimated coefficient vector β^. The residuals refer to the deviations of the data from our model. This is the core difference between them and the error term ε, which is the deviation from the true function: (13.12)ε=yXβ Closely related to the residuals, there are three metrics often calculated with a regression analysis: The sum of squared residuals (SSresid.), the regression sum of squares (SSreg.), and the total sum of squares (SStot.). (13.13)SSresid.=rr=i=1n(yiy^i)2SSreg.=i=1n(y^iy¯)2SStot.=SSresid.+SSreg.=i=1n(yiy¯)2 Another value that can be computed for every linear regression model is the residual standard error (RSE). It is essentially an estimate of the standard deviation of the residual r. The RSE is derived from the sum of squared residuals, and it becomes larger for (unnecessarily) added degrees of freedom. (13.14)RSE=i=1nri2np=SSresid.np All these previously defined values give an idea of the model goodness of fit but they are hard to interpret because their unit depends on the particular model. Much easier to interpret is the coefficient of determination R2, which is the squared Pearson’s correlation coefficient ρ between the predicted values βX and the observed values y and thus lies always between 0 and 1. (13.15)R2=ρ2=cor(Xβ^,y)2=1SSreg.SStot. This metric can only increase with increasing number of predictors, and also for completely random input variables, it will approach 1 as p approaches n. Thus, it is not the best metric for the goodness of fit of a linear model! If anything, we should rather look at the adjusted R2. adj. R2=1n1np(1R2)

2 Actually, the equation SStot.=SSresid.+SSreg. does not hold true in every instance of linear regression model — it assumes that the error follow a Gaussian distribution.

3 Except if tested on independent data, as it is often done following a machine-learning paradigm. However, in the classical use of linear regression we are often dealing with limited data sets, so this approach might not always be feasible.

Test Statistics for OLS

Normally, two tests are applied to a linear model: an F-test and a t-test. The F-test looks at the model as a whole, while the t-test evaluates the individual coefficients. The F-test assesses multiple coefficients simultaneously and give the p-value of the model overall. It can be used to compare different models (although only with the samy y). We can compute the F-score as follows: F=npp1(R21R2) We can compute the p-value of the F-score by comparing it to the F-distribution, using the R function pf. The F-distribution is dependent on two degrees of freedom; df1=p1 and df2=np.

The test that we are probably most interested in is the t-test. The test is based on the assumption, that the errors ε are normally distributed. Based on this assumption, we calculate the probability of having estimated a coefficient β^i as large or larger as we did under the unknown but true scenario of the null hypothesis μ0:βi=0. (13.16)t0=β^μ0seβ^=β^seβ^ where seβ^ are the standard errors of the predictors. The null hypotheses μ0 is that any predictor is zero; so that whole part can be dropped. We can compute the standard errors in the following way from the data and the residuals directly. (13.17)se=diag[r22np(XX)1] where r are the residuals, n is the number of observations (number of rows in X) and p the number of parameters (number of columns in X).

4 Note that the t-test does not return probability estimates of the coefficients themselves, which is a common misconception! To do this, we would need to use Bayesian inference.

5 However, R calculates the standard error via inverting the QR decomposition of the design matrix for processing speed reasons

The t-distribution is defined as follows: (13.18)f(t)=Γ(ν+12)νπΓ(ν2)(1+t2ν)ν+12 where Γ is the gamma function and ν the degrees of freedom.

Given that, we can compute the p-value from the t-distribution using the pt function in R. Formally, the p-value is defined as: (13.19)p=P(t>|t0||H0) The direction from which we look at the t-distribution varies depending on the sign of the coefficient. To remedy that fact, we look at the absolute value of our t-statistic (|t0|).

As discussed above, the p-value is effectively the probability of getting a equal or more extreme estimation of a coefficient as we did when we assume the data generating function is a Gaussian with its mean being the null hypothesis μ0:βi=0 and its variance being the variance of the residuals. Since we use the residual variance, the results of the t-test depends on the overall model specification. This is why specifying a model with different co-variables leads to varying p-values and significance levels for the same variable.

Weighted Least Squares

In our list of standard assumptions about the error term in this linear multiple regression model, we include one that incorporates both homoskedasticity and the absence of autocorrelation. That is, the individual values of the errors are assumed to be generated by a random process whose variance σε2 is constant, and all possible distinct pairs of these values are uncorrelated. This implies that the full error vector, ε, has a diagonal covariance matrix, Σ=σε2I.

6 For some reason, the error covariance matrix is sometimes referred to as Ω. Here, we use Σ though because this is the standard way of denoting (any) covariance matrices.

To assess whether we should apply weighted least squares, we perform a Breusch-Pagan test.

After an initial OLS step, we can perform weighted least squares as follows: We calculate the weights by trying to predict the model’s residuals given the predicted values with a second OLS estimation. W=Ω1=diag(1[(rr)1ry^]2) β^WLS=(XWX)1XWy

-> Apply when the covariance matrix Ω of the errors ϵ is not a diagonal matrix, but not a scaled version of the identity matrix.

We can estimate the covariance matrix of the residuals r as follows: σ^r2=rrnp H=X(XX)1X Cov(r)=σ^r2(IH)=Ω

Generalized Least Squares

-> Apply when the covariance matrix Ω of the errors ϵ is not a diagonal matrix.

Non-spherical errors…

Linear Mixed Effects Models

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. In matrix notation a linear mixed model can be represented as follows. y=Xβ+Zζ+ε In this notation, y is a known vector of obervations, where E(y)=Xβ. Unlike in the OLS model, the errors of this mixed effects model are derived from two sources: on one hand from the error vector ε, but on the other hand also from the random effects Zζ, where we have both E(ε)=0 and E(ζ)=0. So, generally, the expected error in the model is still zero. However, by use of the term Zζ, errors might be clustered into groups, where Z is a design matrix containing information about these groups.

Estimation

This is how the linear mixed effects model is estimated according to Henderson (1954) (). (13.20)(XΣε1XXΣε1ZZΣε1XZΣε1Z+Σζ1)(β^ζ^)=(XΣε1yZΣε1y)

Regularization

Lasso regression

Laplacian prior over coefficients.

(13.21)J(β,X,y)=yXβ22+λβ1

Ridge regression

Gaussian priors over coefficients.

(13.22)J(β,X,y)=yXβ22+λβ22

Elastic net

(13.23)β^elastic net=argminβ{yXβ22+λ[12(1α)β22+αβ1]}

Notation

Table 13.1: Table of notation
Notation Description
aR A real scalar value.
aRn A real vector with n elements.
ARn×p A real matrix with n rows and p columns.
0=[0,0,...,0] The zero vector.
a1:=i=1n|ai| The absolute value or 1 vector norm.
a2:=i=1nai2 The infinity or vector norm.
a,a=aa=a22=i=1nai2 The inner product of a vector a with itself.
XRn×p The design matrix containing the predictors.
yRn The response vector.
rRn The residual vector.
εRn The error vector.
βRp The coefficient vector.
WRn×p The weight matrix.
ΩRn×p The variance-covariance matrix of the errors.

7 This norm is also known as the Manhattan- or taxi-cab norm due to its geometric implication.

Introduction

Regression analysis

The linear model is fundamentally used for regression analyses. Formally, we can define a (multivariate) regression model as a function f as follows. (13.24)f:XRn×pYRn×q We call it a regression model whenever the response variable(s) Y is continuous. The predictors X might be continuous, binary, nominal or ordinal. If Y was non-continuous, we would speak of (multi-class) classification and not regression.

There are a lot of regression models

Curve fitting methods and the messages they send. Illustration by Todd Gureckis.

House of Cards is not a real method, but a common consequence of mis-application of polynomial regression — the model becomes extremely unstable at its fringes.

What linear models are used for

  • Inference: Establish whether a relationship exists – this combines the linear model with statistical tests.
  • Descriptive statistics: Develop an appropriate model to describe data which cannot be fully visualized anymore due to its dimensionality.
  • Prediction or Forecasting: Linear models are used to predict or forecast future values based on existing data. This could be anything from predicting future sales based on past data to forecasting weather patterns.

Functional form

Linear models describe a continuous response variable Y as a function of one or more predictor variables X.

yi=β0xi,0+β1xi,1+...+βpxi,p+εi=j=1pβpxi,p+εi

We may write this more compactly in linear algebra notation.

y=Xβ+ε

X and y are known, while β and ε are unknown.

Components

Design matrix XRn×p
A matrix containing the independent variable values a.k.a. the predictors a.k.a. the features.
Response vector yRn
A vector containing the dependent variable values a.k.a. the response a.k.a. the labels.
Coefficient vector βRp
A vector containing the parameters of the linear model — which are called the coefficients.
Error vector εRn
A vector containing the errors of the model.

Finding the parameters

Loss minimization

The coefficient vector β could theoretically have any form. However, we deem some of them better than others – we want to minimize the value of the residual r.

β^=argminβr22=argminβyXβ22

We take the Euclidean norm (a2=aa) of the error vector r, which means we calculate the loss (L) as:

L=rr=r22=yXβ22=i=1nri2

Sum of squared error

Solving the equation

L=r22=rr=(yXβ)(yXβ)=yyyXβXβy+βXXβLβ=(yyyXβXβy+βXXβ)β=2yX+2XXβ=02XXβ=2yXβ=(XX)1yX

Extremely easy steps

Starting out with the equation for the LM, we want to isolate β. y=Xβ We want to separate X and β. We know that A1A=I, but only square matrices can be inverted. So, we multiply by X.

Xy=XXβ This allows us quickly finding the solution for β. β=(XX)1Xy

Model statistics

Two statistical tests are commonly applied

  • F-test: This test essentially compares two models. One model that includes the predictors of interest (HA:β0) and another reduced model that excludes these predictors (H0:β=0).
  • t-test: This test is used to determine if a single predictor variable has a statistically significant relationship with the response variable (H0:βi=0, and HA:βi0).

The F-test

In the linear regression model, we can calculate the F-statistic as follows.

F=νεSSMνmSSE=(np)y¯Xβ22(p1)yXβ22 Here, νε=np refers to the degree of freedom of the error, while νm=p1 refers to the degree of freedom of the model. SSM are the sum of squares of the model, and SSE are the sum of squares of the errors.

The t-test

ti=βiseβi

It’s all the same

How to apply this in R

Now that we know that β=(XX)1yX, we can get the coefficients of any linear model using matrix multiplication (%*%) and transposition (t()).

y <- swiss$Fertility
X <- cbind(Intercept = 1, Agriculture = swiss$Agriculture)
beta <- solve(t(X) %*% X) %*% t(X) %*% y

We can do this even faster with the lm() function.

model <- lm(Fertility ~ Agriculture, data = swiss)
coefficients(model)
(Intercept) Agriculture 
 60.3043752   0.1942017 

Extensions

  • Mixed effects models
  • Polynomial regression
  • Regularization

Linearisation

Polynomial regression

Interactions

Weighted Least Squares

Weighted least squares (WLS) is a generalization of ordinary least squares (OLS) that can be particularly useful in several scenarios:

  1. Heteroskedasticity: If the error terms in a regression model have non-constant variance across observations, it can lead to inefficient OLS estimates. WLS allows for the modeling of this non-constant variance by assigning weights that are inversely proportional to the variance of the error terms.

  2. Incorporating Different Levels of Precision: In some datasets, different observations may be measured with varying degrees of precision. WLS can be used to give more weight to the observations that are measured with higher precision, reflecting the additional information they contain.

  3. Modeling Correlation between Observations: If observations are correlated (as in time series or panel data), WLS can be used to model this correlation, leading to more efficient estimates.

  4. Analyzing Survey Data: Often in survey data, different respondents or groups might be oversampled to ensure that the sample is representative. WLS can be used to weight the observations in a way that reflects the sampling strategy, ensuring that the estimates are representative of the population of interest.

  5. Combining Information from Different Sources: If you are combining data from different sources that are measured on different scales or have different units, WLS can be used to appropriately weight the information from each source.

  6. Fitting Models to Non-Linear Relationships: When fitting a linear regression model to a relationship that is inherently nonlinear, transforming the dependent and/or independent variables might create heteroskedasticity in the errors. WLS can be used to model this heteroskedasticity.

  7. Robust Estimation: WLS can also be part of an iterative procedure to achieve robust estimates of regression parameters, by assigning lower weights to potential outliers or observations that have high leverage.

  8. Generalized Estimating Equations (GEE): In longitudinal data, WLS is used as a part of GEE to account for the within-subject correlation, providing a more flexible correlation structure.

Weighted Least Squares is a versatile technique that can be applied in many different contexts to achieve more efficient, consistent, or representative estimates. It requires careful selection and justification of the weighting scheme, as an incorrect choice can lead to biased or inefficient estimates. The weights often need to be known (or estimated) up to a multiplicative constant, which means that it’s the relative weights that typically matter, not their absolute values.

β^=(XWX)1XWy

Iteratively reweighted least squares

We can use iterative weighting for finding the coefficients of a model where a closed-form solution doesn’t exist. For example, we can apply the absolute loss function to a linear model, i.e. finding β^ that satisfy: β^=argminβyXβ1 To find such a β^, we can run the following algorithm:

  1. Estimate β^t=0 via OLS.
  2. Calculate the weights wi as wi=|ri|p22, where p is the desired degree of the p norm, i.e. p=1 in this case.
  3. Estimate β^t+1 via weighted least squares using the weights w from before.
  4. Go back to step 2, using β^t+1. Repeat until convergence.
p <- 1
B <- matrix(NA, nrow = 500, ncol = ncol(X))
model <- lm(Fertility ~ ., swiss)

B[1,] <- coefficients(model)
for (i in 2:nrow(B)) {
   w <- abs(residuals(model))^{(p-2)/2}
   model <- lm(Fertility ~ ., swiss, weights = w)
   B[i,] <- coefficients(model)
}

Generalized least squares

Generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model.

  • OLS: Ω=σ2I, i.e. the observations are uncorrelated, the covariance matrix of the errors Ω is diagonal and a scaled identity matrix I. OLS is useful if the Gauss-Markov assumptions are satisfied.
  • WLS: The covariance matrix of the errors Ω is diagonal, but the variance is not constant. WLS is useful in the presence of heteroskedasticity.
  • GLS: Ω is not diagonal, i.e. the errors are correlated. GLS is useful in the presence of autocorrelation.
Figure 13.1: Example of relationships where the Gauss-Markov assumptions are fulfilled (left), heteroskedasticity is present (center), and autocorrelation is present (right).

Linear Mixed Effects Models

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. In matrix notation a linear mixed model can be represented as follows. y=Xβ+Zζ+ε In this notation, y is a known vector of obervations, where E(y)=Xβ. Unlike in the OLS model, the errors of this mixed effects model are derived from two sources: on one hand from the error vector ε, but on the other hand also from the random effects Zζ, where we have both E(ε)=0 and E(ζ)=0. So, generally, the expected error in the model is still zero. However, by use of the term Zζ, errors might be clustered into groups, where Z is a design matrix containing information about these groups.

Estimation

This is how the linear mixed effects model is estimated according to Henderson (1954).

(XΣε1XXΣε1ZZΣε1XZΣε1Z+Σζ1)(β^ζ^)=(XΣε1yZΣε1y)

Regularization

  • Shrinkage estimator
  • Variable selection
  • Small n, large p problem