Analysis of Variance
Comparing population means by looking at their variance.
This document discusses the use of Analysis of Variance (ANOVA) to compare multiple population means in datasets characterized by factor variables. It outlines ANOVA’s key assumptions—such as independence, normality, and homogeneity of variances—and demonstrates methods for verifying these assumptions through statistical tests like Shapiro-Wilk and Levene’s tests. Practical applications are illustrated with R code for one-way ANOVA and the F-test, highlighting their significance in statistical analysis. The document aims to provide a foundational understanding of ANOVA’s application in research methodologies.
An analysis of variance (ANOVA) allows us – contrary to what the name might imply – to compare multiple population means. A population in this case is a subset of a dataset, which should be marked by a factor variable. The null hypothesis of an ANOVA is always
Assumptions for an ANOVA
To conduct an analysis of variance, the underlying data needs to fulfil some preconditions.
- Independence of observations
- Normality
- Homogeneity of variance
The assumption of independence can be determined from the solely based on the study design. It requires that all cases need to be obtained independently and randomly from each population. To check if the other assumptions are fulfilled, we can check the population distributions visually or run some tests.
Testing Normality
The assumption of normality states that all populations should follow a normal distribution. Normality of the distribution of the scores can be tested using tests such as Shapiro-Wilk or Kolmogorov-Smirnov as described in the previous chapter. To run a Shapiro-Wilk test for many populations at the same time, we can make use of the tapply()
function.
with(cabbages, tapply(VitC, Cult, shapiro.test))
Testing Homogeneity of Variance
Homogeneity of variance means, that all populations show the approximately same variance. We can test the homogeneity of variances with three different tests, depending on how well the assumption of normality is fulfilled. Bartlett’s test is very sensitive to departures from normality, so it should only be applied if the data is very much normally distributed.
bartlett.test(VitC ~ Cult, cabbages) # one factor
bartlett.test(VitC ~ interaction(Cult, Date), cabbages) # multiple factors; interactions
Levene’s test is less sensitive to departures from normality, so it can be applied more generally than Bartlett’s test.
::leveneTest(VitC ~ Cult, cabbages)
car::leveneTest(VitC ~ Cult * Date, cabbages) car
At last, Fligner-Killeen’s test is the most robust against any potential departures form normality.
fligner.test(VitC ~ Cult, cabbages)
fligner.test(VitC ~ interaction(Cult, Date), cabbages) # good: no differences detectable
One-Way Analysis of Variance
In the one-way analysis of variance, we investigate whether there are significant differences among the populations of one factor variable, or more aggressively put: whether the factor variable has a significant effect on another variable. We can formulate an ANOVA using the aov() function.
<- aov(VitC ~ Date, cabbages) model
Checking the Preconditions Visually
Whether the assumptions about the data of an ANOVA are fulfilled in our case can also be checked visually after formulating the model by using the generic plot()
function on the model.
plot(model, which = 1:6) # ENTER to skip to next
How to interpret these plots is described in the chapter on the key assumptions of linear models.
The F-test
The F statistic in the context of an ANOVA (Analysis of Variance) or regression analysis is a way to test if there are any statistically significant differences between the means of three or more groups or if the explanatory variables in a linear regression model significantly predict the outcome variable, respectively. It is calculated based on the ratio of two types of variances (mean squares): the variance explained by the model (Mean Square Model, MSM) and the variance unexplained by the model (Mean Square Error, MSE). Here’s how it’s calculated in terms of the residual and model sum of squares:
- Model Sum of Squares (SSM): This represents the variance explained by the model. It’s the sum of squared differences between the predicted values and the overall mean of the dependent variable. It shows how much of the total variation in the dependent variable is explained by the independent variable(s).
- Residual Sum of Squares (SSE): This represents the unexplained variance by the model. It’s the sum of squared differences between the observed values and the predicted values. It indicates the variation in the dependent variable that the model does not explain.
- Total Sum of Squares (SST): This is the total variance in the dependent variable. It can be calculated as the sum of SSM and SSE, or directly as the sum of squared differences between the observed values and the overall mean of the dependent variable.
- Degrees of Freedom for Model (dfM): This is the number of independent variables in the model. For a simple linear regression, dfM would be 1 (since there’s only one independent variable), but it can be more for multiple regression or ANOVA models.
- Degrees of Freedom for Error (dfE): This is calculated as the total number of observations minus the number of parameters being estimated (the intercept and each predictor’s coefficient). In simpler terms, for a sample size
and predictors (including the intercept), it would be . - Mean Square Model (MSM): This is calculated by dividing the Model Sum of Squares (SSM) by its respective degrees of freedom (dfM).
- Mean Square Error (MSE): This is calculated by dividing the Residual Sum of Squares (SSE) by its respective degrees of freedom (dfE).
The F statistic is then calculated as:
This ratio follows an F-distribution under the null hypothesis that all group means are equal (in ANOVA) or that all regression coefficients are zero (in regression analysis), except for the intercept. A higher F value indicates that the model explains a significant portion of the variance in the dependent variable, leading to the rejection of the null hypothesis in favor of the alternative hypothesis that at least one of the group means is different (in ANOVA) or at least one of the predictors is significantly related to the dependent variable (in regression analysis).