The comparison of two population means is very common. Often, we want to find out if the two populations under study have the same mean or if there is some difference in the two population means. The approach we take when studying two population means depends on whether the samples are independent or matched. In the case the samples are independent, we also have to contend with whether or not we know the population standard deviations.
Two populations are independent if the sample taken from population 1 is not related in anyway to the sample taken from population 2. In this situation, any relationship between the samples or populations is entirely coincidental.
Throughout this section, we will use subscripts to identify the values for the means, sample sizes, and standard deviations for the two populations:
Symbol for: | Population 1 | Population 2 |
Population Mean | [latex]\mu_1[/latex] | [latex]\mu_2[/latex] |
Population Standard Deviation | [latex]\sigma_1[/latex] | [latex]\sigma_2[/latex] |
Sample Size | [latex]n_1[/latex] | [latex]n_2[/latex] |
Sample Mean | [latex]\overline_1[/latex] | [latex]\overline_2[/latex] |
Sample Standard Deviation | [latex]s_1[/latex] | [latex]s_2[/latex] |
In order to construct a confidence interval or conduct a hypothesis test on the difference in two population means ([latex]\mu_1-\mu_2[/latex]), we need to use the distribution of the difference in the sample means [latex]\overline_1-\overline_2[/latex]:
As we have seen previously when working with confidence intervals and hypothesis testing for a single population, when the population standard deviation is unknown and we must use the sample standard deviation as an estimate for the population standard deviation, we use a [latex]t[/latex]-distribution. We do the same thing when working with the two population means. When the population standard deviations are unknown, we use the sample standard deviations as estimates for the population standard deviations [latex]\sigma_1[/latex] and [latex]\sigma_2[/latex]. In this situation, we use a [latex]t[/latex]-distribution for the distribution of the difference in the sample means. So, when the population standard deviations are unknown for a confidence interval or hypothesis test on the difference in two population means, we will use a [latex]t[/latex]-distribution. The [latex]t[/latex]-score and the degrees of freedom are:
Obviously, the degrees of freedom formula is somewhat complicated. But a computer makes the calculation a bit more manageable. The output from the degrees of freedom formula is rarely a whole number. After calculating the value of [latex]df[/latex] using the above formula, round the output from this formula down to the next whole number to get the degrees of freedom for the [latex]t[/latex]-distribution.
Suppose a sample of size [latex]n_1[/latex] with sample mean [latex]\overline_1[/latex] and standard deviation [latex]s_1[/latex] is taken from population 1 and a sample of size [latex]n_2[/latex] with sample mean [latex]\overline_2[/latex] and standard deviation [latex]s_2[/latex] is taken from population 2 where the populations are independent and the population standard deviations are unknown. The limits for the confidence interval with confidence level [latex]C[/latex] for the difference in the population means [latex]\displaystyle<\mu_1-\mu_2>[/latex] are:
where [latex]t[/latex] is the positive [latex]t[/latex]-score of the [latex]t[/latex]-distribution with [latex]\displaystyle+\frac\right)^2> \times \left(\frac\right)^2+\frac \times \left(\frac\right)^2>>[/latex] so that the area under the curve in between [latex]-t[/latex] and [latex]t[/latex] is [latex]C\%[/latex].
To find the [latex]t[/latex]-score to construct a confidence interval with confidence level [latex]C[/latex], use the t.inv.2t(area in the tails, degrees of freedom) function.
The output from the t.inv.2t function is the value of [latex]t[/latex]-score needed to construct the confidence interval.
A company that manufacturers and services photocopiers wants to study the difference in the average repair time for the two different models of photocopiers they make. In a sample of 60 repairs of photocopier A, the mean repair time was 84.2 minutes with a standard deviation of 19.4 minutes. In a sample of 70 repairs of photocopier B, the mean repair time was 91.6 minutes with a standard deviation of 18.8 minutes.
Solution:
Photocopier A | Photocopier B |
[latex]n_1=60[/latex] | [latex]n_2=70[/latex] |
[latex]\overline_1=84.2[/latex] | [latex]\overline_2=91.6[/latex] |
[latex]s_1=19.4[/latex] | [latex]s_2=18.8[/latex] |
To find the confidence interval, we need to find the [latex]t[/latex]-score for the 95% confidence interval. This means that we need to find the [latex]t[/latex]-score so that the area in the tails is [latex]1-0.95=0.05[/latex]. [latex]\begin \\ df & = & \frac<\left(\frac+\frac\right)^2> \times \left(\frac\right)^2+\frac \times \left(\frac\right)^2> \\ & = & \frac<\left(\frac+\frac\right)^2> \times \left(\frac\right)^2+\frac \times \left(\frac\right)^2> \\ & = & 123.68. \\ & \Rightarrow & 123 \\ \\ \end[/latex]
Function | t.inv.2t | Answer |
Field 1 | 0.05 | 1.9794… |
Field 2 | 123 |
Assuming that the population standard deviations are unknown, the p-value for a hypothesis test on the difference in two independent population means is the area in the tail(s) of the [latex]t[/latex]-distribution.
If the p-value is the area in the left tail:
If the p-value is the area in the right tail:
If the p-value is the sum of the area in the two tails:
The degrees of freedom for a [latex]t[/latex]-distribution must be a whole number. The output from the degrees of freedom formula [latex]\displaystyle+\frac\right)^2> \times \left(\frac\right)^2+\frac \times \left(\frac\right)^2>>[/latex] is almost never a whole number. After calculating the value of [latex]df[/latex] using the formula, round the value down to the next whole number. Remember to entered the rounded down value of [latex]df[/latex] for the degrees of freedom in the t.dist functions.
A researcher wants to study the difference between the average amount of time boys and girls aged seven to eleven spend playing sports each day. In a sample of 9 girls, the average number of hours spent playing sports per day is 2 hours with a standard deviation of 0.866 hours. In a sample of 16 boys, the average number of hours spent playing sports per day is 3.2 hours with a standard deviation of 1 hours. Both populations have a normal distribution. At the 5% significance level, is there a difference in the mean amount of time boys and girls aged seven to eleven play sports each day?
Solution:
Let girls be population 1 and boys be population 2. These populations are independent because there is no relationship between the two groups. From the questions, we have the following information:
Girls | Boys |
[latex]n_1=9[/latex] | [latex]n_2=16[/latex] |
[latex]\overline_1=2[/latex] | [latex]\overline_2=3.2[/latex] |
[latex]s=0.866[/latex] | [latex]s_2=1[/latex] |
Hypotheses:
[latex]\begin H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \neq 0 \end[/latex]
This is a test on a the difference in two population means where the population standard deviation are unknown. So we use a [latex]t[/latex]-distribution to calculate the p-value. Because the alternative hypothesis is a [latex]\neq[/latex], the p-value is the sum of areas in the tails of the distribution.
To use the t.dist.2t function, we need to calculate out the [latex]t[/latex]-score and the degrees of freedom:
Function | t.dist.rt | Answer |
Field 1 | 0.8899… | 0.1930 |
Field 2 | 17 |
So the p-value[latex]=0.1930[/latex].
Conclusion:
Because p-value[latex]=0.1930 \gt 0.01=\alpha[/latex], we do not reject the null hypothesis. At the 1% significance level there is not enough evidence to suggest that, on average, graduates of College A take more math classes than graduates of College B.
A professor at a large community college taught both an online section and a face-to-face section of his statistics course. The professor wants to study the difference in the average score on the final exam, believing that the mean score for the online section would be lower than the face-to-face section. The professor randomly selected 30 final exam scores from each section and recorded the scores in the tables below.
Online Section:
67.6 | 41.2 | 85.3 | 55.9 | 82.4 | 91.2 | 73.5 | 94.1 | 64.7 | 64.7 |
70.6 | 38.2 | 61.8 | 88.2 | 70.6 | 58.8 | 91.2 | 73.5 | 82.4 | 35.5 |
94.1 | 88.2 | 64.7 | 55.9 | 88.2 | 97.1 | 85.3 | 61.8 | 79.4 | 79.4 |
Face-to-Face Section:
77.9 | 95.3 | 81.2 | 74.1 | 98.8 | 88.2 | 85.9 | 92.9 | 87.1 | 88.2 |
69.4 | 57.6 | 69.4 | 67.1 | 97.6 | 85.9 | 88.2 | 91.8 | 78.8 | 71.8 |
98.8 | 61.2 | 92.9 | 90.6 | 97.6 | 100 | 95.3 | 83.5 | 92.9 | 89.4 |
At the 5% significance level, is the mean of the final exam score for the online section lower than the mean of the final exam score for the face-to-face section?
Solution:
Let the online section be population 1 and the face-to-face section be population 2. These populations are independent because there is no relationship between the two groups. From the questions, we have the following information:
Online | Face-to-Face |
[latex]n_1=30[/latex] | [latex]n_2=30[/latex] |
[latex]\overline_1=72.85[/latex] | [latex]\overline_2=84.98[/latex] |
[latex]s_1=16.918. [/latex] | [latex]s_2=11.714. [/latex] |
Hypotheses:
[latex]\begin H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \lt 0 \end[/latex]
This is a test on a the difference in two population means where the population standard deviation are unknown. So we use a [latex]t[/latex]-distribution to calculate the p-value. Because the alternative hypothesis is a [latex]\lt[/latex], the p-value is the area in the left tail of the distribution.
To use the t.dist function, we need to calculate out the [latex]t[/latex]-score and the degrees of freedom:
Function | t.dist | Answer |
Field 1 | -3.228… | 0.0011 |
Field 2 | 51 | |
Field 3 | true |
So the p-value[latex]=0.0011[/latex].
Conclusion:
Because p-value[latex]=0.0011 \lt 0.05=\alpha[/latex], we do reject the null hypothesis in favour of the alternative hypothesis. At the 5% significance level there is enough evidence to suggest that the mean final exam score for the online section is lower than the face-to-face section.
A study is done to determine if Company A retains its workers longer than Company B. Company A samples 15 workers, and their average time with the company is 5 years with a standard deviation of 1.2 years. Company B samples 20 workers, and their average time with the company is 4.5 years with a standard deviation of 0.8 years. The populations are normally distributed. At the 5% significance level, on average, do workers at Company A stay longer than workers at Company B?
Click to see Solution
Let Company A be population 1 and Company B be population 2.
Hypotheses:
[latex]\begin H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \gt 0 \end[/latex]
Function | t.dist.rt | Answer |
Field 1 | 1.3975… | 0.0878 |
Field 2 | 23 |
Conclusion:
Because p-value[latex]=0.0878 \gt 0.05=\alpha[/latex], we do not reject the null hypothesis. At the 5% significance level there is not enough evidence to suggest that, on average, workers at Company A stay longer than workers at Company B.
The general form of a confidence interval for the difference in two independent population means with unknown population standard deviations is
where [latex]t[/latex] is the positive [latex]t[/latex]-score of the [latex]t[/latex]-distribution with [latex]\displaystyle+\frac\right)^2> \times \left(\frac\right)^2+\frac \times \left(\frac\right)^2>>[/latex] so that the area under the [latex]t[/latex]-distribution in between [latex]-t[/latex] and [latex]t[/latex] is [latex]C[/latex].
The hypothesis test for the difference in two independent population means with unknown population standard deviations is a well established process: