Analysis of Variance

Comparing population means by looking at their variance.

Abstract

This document discusses the use of Analysis of Variance (ANOVA) to compare multiple population means in datasets characterized by factor variables. It outlines ANOVA’s key assumptions—such as independence, normality, and homogeneity of variances—and demonstrates methods for verifying these assumptions through statistical tests like Shapiro-Wilk and Levene’s tests. Practical applications are illustrated with R code for one-way ANOVA and the F-test, highlighting their significance in statistical analysis. The document aims to provide a foundational understanding of ANOVA’s application in research methodologies.

An analysis of variance (ANOVA) allows us – contrary to what the name might imply – to compare multiple population means. A population in this case is a subset of a dataset, which should be marked by a factor variable. The null hypothesis of an ANOVA is always $μ_{1} = μ_{2} . . . = μ_{i}$ , i.e. all means of the population are equal. The alternative hypothesis is that at least one population mean differs (or varies, hence the name ANOVA) significantly from the others. We can apply an ANOVA for the case of only one factor variable, commonly called a one-way ANOVA, or also for the case with two or more factor variables, commonly called two-way or more generally multi-way ANOVA.

Assumptions for an ANOVA

To conduct an analysis of variance, the underlying data needs to fulfil some preconditions.

Independence of observations
Normality
Homogeneity of variance

The assumption of independence can be determined from the solely based on the study design. It requires that all cases need to be obtained independently and randomly from each population. To check if the other assumptions are fulfilled, we can check the population distributions visually or run some tests.

Testing Normality

The assumption of normality states that all populations should follow a normal distribution. Normality of the distribution of the scores can be tested using tests such as Shapiro-Wilk or Kolmogorov-Smirnov as described in the previous chapter. To run a Shapiro-Wilk test for many populations at the same time, we can make use of the tapply() function.

with(cabbages, tapply(VitC, Cult, shapiro.test))

Testing Homogeneity of Variance

Homogeneity of variance means, that all populations show the approximately same variance. We can test the homogeneity of variances with three different tests, depending on how well the assumption of normality is fulfilled. Bartlett’s test is very sensitive to departures from normality, so it should only be applied if the data is very much normally distributed.

bartlett.test(VitC ~ Cult, cabbages) # one factor
bartlett.test(VitC ~ interaction(Cult, Date), cabbages) # multiple factors; interactions

Levene’s test is less sensitive to departures from normality, so it can be applied more generally than Bartlett’s test.

car::leveneTest(VitC ~ Cult, cabbages)
car::leveneTest(VitC ~ Cult * Date, cabbages)

At last, Fligner-Killeen’s test is the most robust against any potential departures form normality.

fligner.test(VitC ~ Cult, cabbages)
fligner.test(VitC ~ interaction(Cult, Date), cabbages) # good: no differences detectable

One-Way Analysis of Variance

In the one-way analysis of variance, we investigate whether there are significant differences among the populations of one factor variable, or more aggressively put: whether the factor variable has a significant effect on another variable. We can formulate an ANOVA using the aov() function.

model <- aov(VitC ~ Date, cabbages)

Checking the Preconditions Visually

Whether the assumptions about the data of an ANOVA are fulfilled in our case can also be checked visually after formulating the model by using the generic plot() function on the model.

plot(model, which = 1:6) # ENTER to skip to next

How to interpret these plots is described in the chapter on the key assumptions of linear models.

Figure 14.1: Visualization of the total sum of squares (SST, A), the model sum of squares (SSM, B), and the sum of squared errors (SSE, C).

The F-test

The F statistic in the context of an ANOVA (Analysis of Variance) or regression analysis is a way to test if there are any statistically significant differences between the means of three or more groups or if the explanatory variables in a linear regression model significantly predict the outcome variable, respectively. It is calculated based on the ratio of two types of variances (mean squares): the variance explained by the model (Mean Square Model, MSM) and the variance unexplained by the model (Mean Square Error, MSE). Here’s how it’s calculated in terms of the residual and model sum of squares:

Model Sum of Squares (SSM): This represents the variance explained by the model. It’s the sum of squared differences between the predicted values and the overall mean of the dependent variable. It shows how much of the total variation in the dependent variable is explained by the independent variable(s).
Residual Sum of Squares (SSE): This represents the unexplained variance by the model. It’s the sum of squared differences between the observed values and the predicted values. It indicates the variation in the dependent variable that the model does not explain.
Total Sum of Squares (SST): This is the total variance in the dependent variable. It can be calculated as the sum of SSM and SSE, or directly as the sum of squared differences between the observed values and the overall mean of the dependent variable.
Degrees of Freedom for Model (dfM): This is the number of independent variables in the model. For a simple linear regression, dfM would be 1 (since there’s only one independent variable), but it can be more for multiple regression or ANOVA models.
Degrees of Freedom for Error (dfE): This is calculated as the total number of observations minus the number of parameters being estimated (the intercept and each predictor’s coefficient). In simpler terms, for a sample size $n$ and $p$ predictors (including the intercept), it would be $n - p - 1$ .
Mean Square Model (MSM): This is calculated by dividing the Model Sum of Squares (SSM) by its respective degrees of freedom (dfM).
Mean Square Error (MSE): This is calculated by dividing the Residual Sum of Squares (SSE) by its respective degrees of freedom (dfE).

The F statistic is then calculated as:

$\begin{matrix} (14.1) & F = \frac{explained variance}{unexplained variance} = \frac{M S M}{M S E} = \frac{\frac{S S M}{d f_{M}}}{\frac{S S E}{d f_{E}}} \end{matrix}$

This ratio follows an F-distribution under the null hypothesis that all group means are equal (in ANOVA) or that all regression coefficients are zero (in regression analysis), except for the intercept. A higher F value indicates that the model explains a significant portion of the variance in the dependent variable, leading to the rejection of the null hypothesis in favor of the alternative hypothesis that at least one of the group means is different (in ANOVA) or at least one of the predictors is significantly related to the dependent variable (in regression analysis).

--- title: Analysis of Variance subtitle: Comparing population means by looking at their variance. abstract: | This document discusses the use of Analysis of Variance (ANOVA) to compare multiple population means in datasets characterized by factor variables. It outlines ANOVA's key assumptions—such as independence, normality, and homogeneity of variances—and demonstrates methods for verifying these assumptions through statistical tests like Shapiro-Wilk and Levene's tests. Practical applications are illustrated with R code for one-way ANOVA and the F-test, highlighting their significance in statistical analysis. The document aims to provide a foundational understanding of ANOVA's application in research methodologies. --- An analysis of variance (ANOVA) allows us -- contrary to what the name might imply -- to compare multiple population means. A population in this case is a subset of a dataset, which should be marked by a factor variable. The null hypothesis of an ANOVA is always $\mu_1 = \mu_2 ...= \mu_i$, i.e. all means of the population are equal. The alternative hypothesis is that at least one population mean differs (or varies, hence the name ANOVA) significantly from the others. We can apply an ANOVA for the case of only one factor variable, commonly called a one-way ANOVA, or also for the case with two or more factor variables, commonly called two-way or more generally multi-way ANOVA. ### Assumptions for an ANOVA To conduct an analysis of variance, the underlying data needs to fulfil some preconditions. - Independence of observations - Normality - Homogeneity of variance The assumption of independence can be determined from the solely based on the study design. It requires that all cases need to be obtained independently and randomly from each population. To check if the other assumptions are fulfilled, we can check the population distributions visually or run some tests. ### Testing Normality The assumption of normality states that all populations should follow a normal distribution. Normality of the distribution of the scores can be tested using tests such as Shapiro-Wilk or Kolmogorov-Smirnov as described in the previous chapter. To run a Shapiro-Wilk test for many populations at the same time, we can make use of the `tapply()` function. ```r with(cabbages, tapply(VitC, Cult, shapiro.test)) ``` ### Testing Homogeneity of Variance Homogeneity of variance means, that all populations show the approximately same variance. We can test the homogeneity of variances with three different tests, depending on how well the assumption of normality is fulfilled. Bartlett's test is very sensitive to departures from normality, so it should only be applied if the data is very much normally distributed. ```r bartlett.test(VitC ~ Cult, cabbages) # one factor bartlett.test(VitC ~ interaction(Cult, Date), cabbages) # multiple factors; interactions ``` Levene's test is less sensitive to departures from normality, so it can be applied more generally than Bartlett’s test. ```r car::leveneTest(VitC ~ Cult, cabbages) car::leveneTest(VitC ~ Cult * Date, cabbages) ``` At last, Fligner-Killeen's test is the most robust against any potential departures form normality. ```r fligner.test(VitC ~ Cult, cabbages) fligner.test(VitC ~ interaction(Cult, Date), cabbages) # good: no differences detectable ``` ## One-Way Analysis of Variance In the one-way analysis of variance, we investigate whether there are significant differences among the populations of one factor variable, or more aggressively put: whether the factor variable has a significant effect on another variable. We can formulate an ANOVA using the aov() function. ```r model <- aov(VitC ~ Date, cabbages) ``` ### Checking the Preconditions Visually Whether the assumptions about the data of an ANOVA are fulfilled in our case can also be checked visually after formulating the model by using the generic `plot()` function on the model. ```r plot(model, which = 1:6) # ENTER to skip to next ``` How to interpret these plots is described in the chapter on the key assumptions of linear models. ```{r} #| echo: false #| eval: true #| label: fig-anova #| fig-cap: Visualization of the total sum of squares (SST, **A**), the model sum of squares (SSM, **B**), and the sum of squared errors (SSE, **C**). #| fig-format: svg #| fig-width: 8 #| fig-height: 5 invisible(list2env(yaml::read_yaml("parameters.yml"), envir = .GlobalEnv)) data(cabbages, package = "MASS") model <- aov(VitC ~ Date, cabbages) # Graphical parameters dist <- 8 linewidth = 3 # Get the values and create index vector vitamins <- cabbages$VitC[order(cabbages$Date)] vit <- c(vitamins[1:20], rep(NA,dist), vitamins[21:40], rep(NA,dist), vitamins[41:60]) ind <- c(1:length(vit)) coeff <- with(cabbages, tapply(VitC, Date, mean)) coeff_vector <- unname(c(rep(coeff[1], 20), rep(NA,dist), rep(coeff[2], 20), rep(NA,dist), rep(coeff[3], 20))) mean <- mean(vitamins) seg_x0 <- c(1,21+dist,41+2*dist) seg_x1 <- c(20,40+dist,60+2*dist) # SST par(mfrow = c(1,3), mar = rep(1,4)) plot(vit~ind, axes = FALSE, xlab = "", ylab = "", pch = 16, cex = 0.8, ylim = c(39, 84)) for (i in ind) { segments(x0 = i, y0 = mean, y1 = vit[i], col = red) } abline(h = mean, lwd = linewidth) abline(v = c(-2, seg_x1[1]+dist/2, seg_x1[2]+dist/2, seg_x1[3]+3), lwd = 1, lty = 3) points(vit~ind, xlab = "", ylab = "", pch = 16, cex = 0.8) text(x = c(11, median(ind), 66), y = 38, labels = c("d16", "d20", "d21"), cex = 1.2) box() text(5, 85, "A", font = 2, cex = 3, xpd = NA, pos = 1) # SSM plot(vit~ind, axes = FALSE, xlab = "", ylab = "", pch = 16, cex = 0.8, ylim = c(39, 84)) for (i in ind) { segments(x0 = i, y0 = coeff_vector[i], y1 = mean, col = red) } abline(h = mean, lwd = linewidth) for (i in 1:3) { segments(x0 = seg_x0[i], x1 = seg_x1[i], y0 = coeff[i], lwd = linewidth) } abline(h = mean, lwd = linewidth) abline(v = c(-2, seg_x1[1]+dist/2, seg_x1[2]+dist/2, seg_x1[3]+3), lwd = 1, lty = 3) points(vit~ind, xlab = "", ylab = "", pch = 16, cex = 0.8) text(x = c(11, median(ind), 66), y = 38, labels = c("d16", "d20", "d21"), cex = 1.2) box() text(5, 85, "B", font = 2, cex = 3, xpd = NA, pos = 1) # SSE plot(vit~ind, axes = FALSE, xlab = "", ylab = "", pch = 16, cex = 0.8, ylim = c(39, 84)) for (i in ind) { segments(x0 = i, y0 = coeff_vector[i], y1 = vit[i], col = red) } abline(h = mean, lwd = linewidth) for (i in 1:3) { segments(x0 = seg_x0[i], x1 = seg_x1[i], y0 = coeff[i], lwd = linewidth) } abline(h = mean, lwd = linewidth) abline(v = c(-2, seg_x1[1]+dist/2, seg_x1[2]+dist/2, seg_x1[3]+3), lwd = 1, lty = 3) points(vit~ind, xlab = "", ylab = "", pch = 16, cex = 0.8) text(x = c(11, median(ind), 66), y = 38, labels = c("d16", "d20", "d21"), cex = 1.2) box() text(5, 85, "C", font = 2, cex = 3, xpd = NA, pos = 1) ``` ## The F-test The F statistic in the context of an ANOVA (Analysis of Variance) or regression analysis is a way to test if there are any statistically significant differences between the means of three or more groups or if the explanatory variables in a linear regression model significantly predict the outcome variable, respectively. It is calculated based on the ratio of two types of variances (mean squares): the variance explained by the model (Mean Square Model, MSM) and the variance unexplained by the model (Mean Square Error, MSE). Here's how it's calculated in terms of the residual and model sum of squares: 1. **Model Sum of Squares (SSM)**: This represents the variance explained by the model. It's the sum of squared differences between the predicted values and the overall mean of the dependent variable. It shows how much of the total variation in the dependent variable is explained by the independent variable(s). 2. **Residual Sum of Squares (SSE)**: This represents the unexplained variance by the model. It's the sum of squared differences between the observed values and the predicted values. It indicates the variation in the dependent variable that the model does not explain. 3. **Total Sum of Squares (SST)**: This is the total variance in the dependent variable. It can be calculated as the sum of SSM and SSE, or directly as the sum of squared differences between the observed values and the overall mean of the dependent variable. 4. **Degrees of Freedom for Model (dfM)**: This is the number of independent variables in the model. For a simple linear regression, dfM would be 1 (since there's only one independent variable), but it can be more for multiple regression or ANOVA models. 5. **Degrees of Freedom for Error (dfE)**: This is calculated as the total number of observations minus the number of parameters being estimated (the intercept and each predictor's coefficient). In simpler terms, for a sample size $n$ and $p$ predictors (including the intercept), it would be $n - p - 1$. 6. **Mean Square Model (MSM)**: This is calculated by dividing the Model Sum of Squares (SSM) by its respective degrees of freedom (dfM). 7. **Mean Square Error (MSE)**: This is calculated by dividing the Residual Sum of Squares (SSE) by its respective degrees of freedom (dfE). The **F statistic** is then calculated as: $$F = \frac{\textsf{explained variance}}{\textsf{unexplained variance}} = \frac{MSM}{MSE} = \frac{\frac{SSM}{df_{M}}}{\frac{SSE}{df_E}}$${#eq-f-statistic} This ratio follows an F-distribution under the null hypothesis that all group means are equal (in ANOVA) or that all regression coefficients are zero (in regression analysis), except for the intercept. A higher F value indicates that the model explains a significant portion of the variance in the dependent variable, leading to the rejection of the null hypothesis in favor of the alternative hypothesis that at least one of the group means is different (in ANOVA) or at least one of the predictors is significantly related to the dependent variable (in regression analysis).