When outcome data are expected to be more similar within subgroups than across subgroups the data are typically called Clustered outcomes. There are many ways data can be correlated:
- Sampling students within schools. (Within school correlation)
- Repeatedly measuring the same tree in different places. (Within-subject correlation)
- Repeatedly measuring the same tree over time. (Temporal correlation)
- Poverty measurements from different, but neighboring, counties. (Spatial correlation)
In these cases, the assumption of independence between observations is often violated \(Cor(\epsilon_i, \epsilon_j)\neq 0, \forall i\neq j\).
Analyses should take into account such correlation or else conclusions might not be valid.
Correlation is not limited to one level of subgroup; i.e. repeated measurements on students within schools.
Models that account for multiple levels of clustering are often called multi-level models.
Terminology: Fixed Effects vs Random Effects
- Fixed Effects: The variable is thought to have it’s own specific effect on the outcome relative to some reference group. The factors are fixed by nature, there are only the levels observed in the data.
- The 23 schools in the data make up the entire universe of schools, and we are interested in how one particular school fares.
- Random Effects: The variable was chosen at random, and the levels observed in the data are thought of as representative of all possible levels that could be sampled. We are interested in the distribution of the effects rather than the effect of any one specific level.
- The 23 schools in the data set is a sample of a larger population of schools and we are interested in the overall distribution of how school can affect the math score, not any one specific school.