5.3 (Q~B) Two means: T-Test
It is common to compare means from different samples. For instance, we might investigate the effectiveness of a certain educational intervention by looking for evidence of greater reading ability in the treatment group against a control group. That is, our research hypothesis is that reading ability of a child is associated with an educational intervention.
The null hypothesis states that there is no relationship, or no effect, of the educational intervention (binary explanatory variable) on the reading ability of the child (quantitative response variable). This can be written in symbols as follows:
H0:μ1=μ2 or H0:μ1−μ2=0
where μ1 is the average reading score for students in the control group (no intervention) and μ2 be the average reading score for students in the intervention group. Notice it can be written as one mean equals the other, but also as the difference between two means equaling zero. The alternative hypothesis HA states that there is a relationship:
HA:μ1≠μ2 or HA:μ1−μ2≠0
5.3.1 Assumptions
- The data distribution for each group is approximately normal.
- The scores are independent within each group.
- The scores from the two groups are independent of each other (i.e. the two samples are independent).
5.3.2 Sampling Distribution for the difference
We use ˉx1−ˉx2 as a point estimate for μ1−μ2, which has a standard error of
SEˉx1−ˉx2=√SE2ˉx1+SE2ˉx2=√σ21n1+σ22n2
So the equations for a Confidence Interval is,
(ˉx1−ˉx2)±tα2,df√σ21n1+σ22n2
and Test Statistic is,
t∗=(ˉx1−ˉx2)−d0(√σ21n1+σ22n2)
Typically it is unlikely that the population variances σ21 and σ22 are known so we will use sample variances s21 and s22 as estimates.
While you may never hand calculate these equations, it is important to see the format, or structure, of the equations. Equation ??? has the same format of
point estimate±2∗standard error
regardless what it is we’re actually trying to estimate. Thus in a pinch, you can calculate approximate confidence intervals for whatever estimate you are trying to understand, given only the estimate and standard error, even if the computer program does not give it to you easily or directly.
5.3.3 Example: Smoking and BMI
We would like to know, is there convincing evidence that the average BMI differs between those who have ever smoked a cigarette in their life compared to those who have never smoked? This example uses the Addhealth dataset.
1. Identify response and explanatory variables.
- The quantitative response variable is BMI (variable )
- The binary explanatory variable is whether the person has ever smoked a cigarette (variable )
2. Visualize and summarize bivariate relationship.
⚠️ Using na.omit()
is dangerous! This will remove ALL rows with ANY missing data in it. Regardless if the missing values are contained in the variables you are interested in.
The example below employs a trick/work around to not have NA values show in the output.
We take the data set addhealth
and then select
the variables we want to plot, and then we use na.omit()
to delete all rows with missing data. Then that is saved as a new, temporary data frame specifically named for this case (plot.bmi.smoke
).
note for later. Move this explanation into data viz section.
plot.bmi.smoke <- addhealth %>% select(eversmoke_c, BMI) %>% na.omit()
ggplot(plot.bmi.smoke, aes(x=eversmoke_c, y=BMI, fill=eversmoke_c)) +
geom_boxplot(width=.3) + geom_violin(alpha=.4) +
labs(x="Smoking status") +
scale_fill_viridis_d(guide=FALSE) +
stat_summary(fun.y="mean", geom="point", size=3, pch=17,
position=position_dodge(width=0.75))
plot.bmi.smoke %>% group_by(eversmoke_c) %>%
summarise(mean=mean(BMI, na.rm=TRUE),
sd = sd(BMI, na.rm=TRUE),
IQR = IQR(BMI, na.rm=TRUE))
## # A tibble: 2 × 4
## eversmoke_c mean sd IQR
## <fct> <dbl> <dbl> <dbl>
## 1 Non Smoker 29.7 7.76 9.98
## 2 Smoker 28.8 7.32 9.02
Smokers have an average BMI of 28.8, smaller than the average BMI of non-smokers at 29.7. Nonsmokers have more variation in their BMIs (sd 7.8 v. 7.3 and IQR 9.98 v. 9.02), but the distributions both look normal, if slightly skewed right.
3. Write the relationship you want to examine in the form of a research question.
- Null Hypothesis: There is no relationship between BMI and smoking status.
- Alternate Hypothesis: There is a relationship between BMI and smoking status.
4. Perform an appropriate statistical analysis.
I. Let μ1 denote the average BMI for nonsmokers, and μ2 the average BMI for smokers.
μ1−μ2=0 There is no difference in the average BMI between smokers and nonsmokers. μ1−μ2≠0 There is a difference in the average BMI between smokers and nonsmokers.
We are comparing the means between two independent samples. A Two-Sample T-Test for a difference in means will be conducted. The assumptions that the groups are independent is upheld because each individual can only be either a smoker or nonsmoker. The difference in sample means ¯x1−¯x2 is normally distributed – this is a valid assumption due to the large sample size and that differences typically are normally distributed. The observations are independent, and the variability is roughly equal (IQR 9.9 v. 9.0).
We use the
t.test
function, but use model notation of the formatoutcome
∼category
. Here,BMI
is our continuous outcome that we’re testing across the (binary) categorical predictoreversmoke_c
.
t.test(BMI ~ eversmoke_c, data=addhealth)
##
## Welch Two Sample t-test
##
## data: BMI by eversmoke_c
## t = 3.6937, df = 3395.3, p-value = 0.0002245
## alternative hypothesis: true difference in means between group Non Smoker and group Smoker is not equal to 0
## 95 percent confidence interval:
## 0.3906204 1.2744780
## sample estimates:
## mean in group Non Smoker mean in group Smoker
## 29.67977 28.84722
We have very strong evidence against the null hypothesis, p=0.0002.
5. Write a conclusion in the context of the problem.
On average, nonsmokers have a significantly higher BMI by 0.83 (0.39, 1.27) compared to nonsmokers (p=0.0002).
⚠️ Always check the output against the direction you are testing. R always will calculate a difference as group 1 - group 2, and it defines the groups alphabetically. For example, for a factor variable that has groups A and B, R will automatically calculate the difference as A-B. In this example it is Nonsmoker - Smoker.