ANOVA with SPSS
Never, ever, run any statistical test without performing EDA first!
What's wrong with t-tests?
Nothing, except ...
If you want to compare three or more groups using t-tests with the usual 0.05 level of significance, you would have to compare the three groups pairwise (A to B, A to C, B to C), so the chance of getting the wrong result would be:
1 - (0.95 x 0.95 x 0.95) = 14.3%
If you wanted to compare four or more groups, the chance of getting the wrong result would be (0.95)6 = 26%, and for five groups, 40%. Not good, is it? So we use ANOVA. Never perform multiple t-tests: Anyone on this module discovered performing multiple t-tests when they should use ANOVA will be shot!
ANalysis Of VAriance (ANOVA) is such an important statistical method that it would be easy to spend a whole module on this test alone. Like the t-test, ANOVA is a parametric test which assumes:
•data is numerical data representing samples from normally distributed populations
•the variances of the groups are "similar"
•the sizes of the groups are "similar"
•the groups should be independent
so it's important to carry out EDA before starting AVOVA! In fact, ANOVA is quite a robust procedure, so as long as the groups are similar, the test is normally reliable.
ANOVA tests the null hypothesis that the means of all the groups being compared are equal, and produces a statistic called F which is equivalent to the t-statistic from a t-test. But there's a catch. If the means of all the groups tested by ANOVA are equal, fine. But if the result tells us to reject the null hypothesis, we still don't know which of the means differ. We solve this problem by performing what is known as a "post hoc" (after the event) test.
Reminder:
•Independent variable: Variables which are experimentally manipulated by an investigator are called independent variables.
•Dependent variable: Variables which are measured are called dependent variables (because they are presumed to depend on the value of the independent variable).
ANOVA jargon:
•Way = an independent variable, so a one-way ANOVA has one independent variable, two-way ANOVA has two independent variables, etc. Simple ANOVA tests the hypothesis that means from two or more samples are equal (drawn from populations with the same mean). Student's t-test is actually a particular application of one-way ANOVA (two groups compared).
•Factor = a test or measurement. Single-factor ANOVA tests whether the means of the groups being compared are equal and returns a yes/no answer, two-factor ANOVA simultaneously tests two or more factors, e.g. tumour size after treatment with different drugs and/or radiotherapy (drug treatment is one factor and radiotherapy is another). So, "factor" and "way" are alternative terms for the same thing (inpependent variables).
•Repeated measures: Used when members of a sample are measured under different conditions. As the sample is exposed to each condition, the measurement of the dependent variable is repeated. Using standard ANOVA is not appropriate because it fails to take into account correlation between the repeated measures, violating the assumption of independence. This approach can be used for several reasons, e.g. where research requires repeated measures, such as longitudinal research which measures each sample member at each of several ages - age is a repeated factor. This is comparable to a paired t-test.
The array of options for different ANOVA tests in SPSS is confusing, so I'll go through the most important bits using some examples.
One-Way / Single-Factor ANOVA:
Data:
Pain Scores for Analgesics
Drug: Pain Score:
Diclofenac 0, 35, 31, 29, 20, 7, 43, 16
Ibuprophen 30, 40, 27, 25, 39, 15, 30, 45
Paracetamol 16, 33, 25, 32, 21, 54, 57, 19
Asprin 55, 58, 56, 57, 56, 53, 59, 55
Since it would be unethical to withhold pain relief, there is no control group and we are just interested in knowing whether one drug performs better (lower pain score) than another, so we need to perform a one-way/single-factor ANOVA.
We enter this data into SPSS using dummy values (1, 2, 3, 4) for the drugs so this numeric data can be used in the ANOVA:
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!
EDA (Analyzer: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:
Analyze: Compare Means: One-Way ANOVA
Dependent variable: Pain Score
Factor: Drug:
•SPSS allows many different post hoc tests. Click Post Hoc and select the Tukey and Games-Howell tests.
◦The Tukey test is powerful and widely accepted, but is parametric in that it assumes that the population variances are equal. It also assumes that the sample sizes are equal. If this is not the case, you should use Gabriel's procedure, or if the sizes are very different, use Hochberg's GT2.
◦Games-Howell does not assume population variances are equal or that sample sizes are equal, so is a good alternative if this turns out to be the case.
•Click Options and select Homogeneity of Variance Test, Brown-Forsythe and Welch. The homogeneity of variance test is important since this is an assumption of ANOVA, but if this assumption turns out to be broken, the Brown-Forsythe and Welch options will display alternative versions of the F statistic which means you may still be able to use the result.
•Click OK to run the tests.
Output:
Test of Homogeneity of Variances: Pain Levene Statistic df1 df2 Sig.
4.837 3 28 .008
The significance value for homogeneity of variances is <.05, so the variances of the groups are significantly different. Since this is an assumption of ANOVA, we need to be very careful in interpreting the outcome of this test:
ANOVA: Pain
Sum of Squares df Mean Square F Sig.
Between Groups 4956.375 3 1652.125 11.967 .000
Within Groups 3865.500 28 138.054
Total 8821.875 31
This is the main ANOVA result. The significance value comparing the groups (drugs) is <.05, so we could reject the null hypothesis (there is no difference in the mean pain scores with the four drugs). However, since the variances are significantly different, this might be the wrong answer. Fortunately, the Welch and Brown-Forsythe statistics can still be used in these circumstances:
Robust Tests of Equality of Means: Pain
Statistic df1 df2 Sig.
Welch 32.064 3 12.171 .000
Brown-Forsythe 11.967 3 18.889 .000
The significance value of these are both <.05, so we still reject the null hypothesis. However, this result does not tell us which drugs are responsible for the difference, so we need the post hoc test results:
Multiple Comparisons
Dependent Variable: Pain
(I) Drug (J) Drug Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval
Lower Bound Upper Bound
Tukey HSD 1 2 -8.750 5.875 .457 -24.79 7.29
3 -9.500 5.875 .386 -25.54 6.54
4 -33.500(*) 5.875 .000 -49.54 -17.46
2 1 8.750 5.875 .457 -7.29 24.79
3 -.750 5.875 .999 -16.79 15.29
4 -24.750(*) 5.875 .001 -40.79 -8.71
3 1 9.500 5.875 .386 -6.54 25.54
2 .750 5.875 .999 -15.29 16.79
4 -24.000(*) 5.875 .002 -40.04 -7.96
4 1 33.500(*) 5.875 .000 17.46 49.54
2 24.750(*) 5.875 .001 8.71 40.79
3 24.000(*) 5.875 .002 7.96 40.04
Games-Howell 1 2 -8.750 6.176 .513 -27.05 9.55
3 -9.500 7.548 .602 -31.45 12.45
4 -33.500(*) 5.194 .001 -50.55 -16.45
2 1 8.750 6.176 .513 -9.55 27.05
3 -.750 6.485 .999 -20.09 18.59
4 -24.750(*) 3.471 .001 -36.03 -13.47
3 1 9.500 7.548 .602 -12.45 31.45
2 .750 6.485 .999 -18.59 20.09
4 -24.000(*) 5.558 .014 -42.26 -5.74
4 1 33.500(*) 5.194 .001 16.45 50.55
2 24.750(*) 3.471 .001 13.47 36.03
3 24.000(*) 5.558 .014 5.74 42.26
* The mean difference is significant at the .05 level.
The Tukey test relies on homogeneity of variance, so we ignore these results. The Games-Howell post-hoc test does not rely on homogeneity of variance (this is why we used two different post-hoc tests) and so can be used. SPSS kindly flags (*) which differences are significant!
Result: Drug 4 (Asprin) produces significantly different result from the other three drugs:
Formal Reporting: When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:
There is a significant difference between the pain scores for asprin and the other three drugs tested, F(3,28) = 11.97, p < .05.
Two-Factor ANOVA
Do anti-cancer drugs have different effects in males and females?
Data:
Drug: cisplatin vinblastine 5-fluorouracil
Gender:
Female Male Female Male Female Male
Tumour
Size: 65 50 70 45 55 35
70 55 65 60 65 40
60 80 60 85 70 35
60 65 70 65 55 55
60 70 65 70 55 35
55 75 60 70 60 40
60 75 60 80 50 45
50 65 50 60 50 40
We enter this data into SPSS using dummy values for the drugs (1, 2, 3) and genders (1,2) so the coded data can be used in the ANOVA:
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!
EDA (Analyze: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:
Analyze: General Linear Model: Univariate
Dependent variable: Tumour Diameter
Fixed Factors: Gender, Drug:
Also select:
Post Hoc: Tukey and Games-Howell:
Options:
Display Means for: Gender, Drug, Gender*Drug
Descriptive Statistics
Homogeneity tests:
Output:
Levene's Test of Equality of Error Variances(a)
Dependent Variable: Diameter F df1 df2 Sig.
1.462 5 42 .223
Tests the null hypothesis that the error variance of the dependent variable is equal across groups.
a Design: Intercept+Gender+Drug+Gender * Drug
The significance result for homogeneity of variance is >.05, which shows that the error variance of the dependent variable is equal across the groups, i.e. the assumption of the ANOVA test has been met.
Tests of Between-Subjects Effects
Dependent Variable: Diameter Source Type III Sum of Squares df Mean Square F Sig.
Corrected Model 3817.188(a) 5 763.438 10.459 .000
Intercept 167442.188 1 167442.188 2294.009 .000
Gender 42.188 1 42.188 .578 .451
Drug 2412.500 2 1206.250 16.526 .000
Gender * Drug 1362.500 2 681.250 9.333 .000
Error 3065.625 42 72.991
Total 174325.000 48
Corrected Total 6882.813 47
a R Squared = .555 (Adjusted R Squared = .502)
The highlighted values are significant (<.05), but there is no effect of gender (p = 0.451). Again, this does not tell us which drugs behave differently, so again we need to look at the post hoc tests:
Multiple Comparisons
Dependent Variable: Diameter
(I) Drug (J) Drug Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval
Lower Bound Upper Bound
Tukey HSD cisplatin vinblastine -1.25 3.021 .910 -8.59 6.09
5-flourouracil 14.38(*) 3.021 .000 7.04 21.71
vinblastine cisplatin 1.25 3.021 .910 -6.09 8.59
5-flourouracil 15.63(*) 3.021 .000 8.29 22.96
5-flourouracil cisplatin -14.38(*) 3.021 .000 -21.71 -7.04
vinblastine -15.63(*) 3.021 .000 -22.96 -8.29
Games-Howell cisplatin vinblastine -1.25 3.329 .925 -9.46 6.96
5-flourouracil 14.38(*) 3.534 .001 5.64 23.11
vinblastine cisplatin 1.25 3.329 .925 -6.96 9.46
5-flourouracil 15.63(*) 3.699 .001 6.50 24.75
5-flourouracil cisplatin -14.38(*) 3.534 .001 -23.11 -5.64
vinblastine -15.63(*) 3.699 .001 -24.75 -6.50
Based on observed means.
* The mean difference is significant at the .05 level.
In this example, we can use the Tukey or Games-Howell results. Again, SPSS helpfully flags which results have reached statistical significance. We already know from the main ANOVA table that the effect of gender is not significant, but the post hoc tests show which drugs produce significantly different outcomes.
Formal Reporting: When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:
There is a significant difference between the tumour diameter for 5-flourouracil and the other two drugs tested, F(5,47) = 10.46, p < .05.
Repeated Measures ANOVA
Remember that one of the assumptions of ANOVA is independence of the groups being compared. In lots of circumstances, we want to test the same thing repeatedly, e.g:
•Patients with a chronic disease after 3, 6 and 12 months of drug treatment
•Repeated sampling from the same location, e.g. spring, summer, autumn and winter
•etc
This type of study reduces variability in the data and so increases the power to detect effects, but violates the assumption of independence, so as with the paired t-test, we need to use a special form of ANOVA called repeated measures. In a parametric test, the assumption that the relationship between pairs of groups is equal is called "sphericity". Violating sphericity means that the F statistic cannot be compared to the normal tables of F, and so software cannot calculate a significance value. SPSS includes a procedure called Mauchly's test which tells us if the assumption of sphericity has been violated:
•If Mauchly’s test statistic is significant (i.e. p 0.05) we conclude that the condition of sphericity has not been met.
•If, Mauchly’s test statistic is nonsignificant (i.e. p >.05) it is reasonable to conclude that the variances of differences are not significantly different.
If Mauchly’s test is significant then we cannot trust the F-ratios produced by SPSS unless we apply a correction (which, fortunately, SPSS helps us to do).
One-Way Repeated Measures ANOVA
i.e. one independent variable, e.g. pain score after surgery:
Patient1 Patient2 Patient3
1 3 1
2 5 3
4 6 6
5 7 4
5 9 1
6 10 3
This data can be entered directly into SPSS. Note that each column represents a repeated measures variable (patients in this case). There is no need for a coding variable (as with between-group designs, above):
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret! Next:
Analyze: General Linear Model: Repeated Measures
Within-Subject factor name: Patient
Number of Levels: 3 (because there are 3 patients)
Click Add, then Define (factors):
There are no proper post hoc tests for repeated measures variables in SPSS. However, via the Options button, you can use the paired t-test procedure to compare all pairs of levels of the independent variable, and then apply a Bonferroni correction to the probability at which you accept any of these tests. The resulting probability value should be used as the criterion for statistical significance. A ‘Bonferroni correction’ is achieved by dividing the probability value (usually 0.05) by the number of tests conducted, e.g. if we compare all levels of the independent variable of these data, we make three comparisons and so the appropriate significance level is 0.05/3 = 0.0167. Therefore, we accept t-tests as being significant only if they have a p value <0.0167.
Output:
Mauchly's Test of Sphericity Within Subjects Effect Mauchly's W Approx. Chi-Square df Sig. Epsilon
Greenhouse-Geisser Huynh-Feldt Lower-bound
patient .094 9.437 2 .009 .525 .544 .500
Mauchly’s test is significant (p <.05) so we conclude that the assumption of sphericity has not been met.
Tests of Within-Subjects Effects Source
Type III Sum of Squares df Mean Square F Sig.
patient Sphericity Assumed 44.333 2 22.167 8.210 .008
Greenhouse-Geisser 44.333 1.050 42.239 8.210 .033
Huynh-Feldt 44.333 1.088 40.752 8.210 .031
Lower-bound 44.333 1.000 44.333 8.210 .035
Error(patient) Sphericity Assumed 27.000 10 2.700
Greenhouse-Geisser 27.000 5.248 5.145
Huynh-Feldt 27.000 5.439 4.964
Lower-bound 27.000 5.000 5.400
Because the significance values are <.05, we conclude that there was a significant difference between the three patients, but this test does not tell us which patients differed from each other. The next issue is which of the three corrections to use. Going back to Mauchly's test:
•If epsilon is >0.75, use the Huynh-Feldt correction.
•If epsilon is <0.75, or nothing is known about sphericity at all, use the Greenhouse-Geisser correction.
•In this example, the epsilon values from Mauchly's test values are 0.525 and 0.544, both <0.75, so we use the Greenhouse-Geisser corrected values. Using this correction, F is still significant because its p value is 0.033, which is <.05.
Post Hoc Tests:
Pairwise Comparisons (I) patient (J) patient Mean Difference (I-J) Std. Error Sig.(a) 95% Confidence Interval for Difference(a)
Lower Bound Upper Bound
1 2 -2.833(*) .401 .003 -4.252 -1.415
3 .833 .946 1.000 -2.509 4.176
2 1 2.833(*) .401 .003 1.415 4.252
3 3.667 1.282 .106 -.865 8.199
3 1 -.833 .946 1.000 -4.176 2.509
2 -3.667 1.282 .106 -8.199 .865
Based on estimated marginal means
* The mean difference is significant at the .05 level.
a Adjustment for multiple comparisons: Bonferroni.
Formal reporting:
Mauchly’s test indicated that the assumption of sphericity had been violated (chi-square = 9.44, p <.05), therefore degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (epsilon = 0.53). The results show that the pain scores of the three patients differed significantly, F(1.05, 5.25) = 8.21, p <.05. Post hoc tests revealed that although the pain score of Patient2 was significantly higher than that of than Patient1 (p<.001), Patient3's score was not significantly differently from either of the other patients (both p>.05).
Two-Way Repeated Measures ANOVA
i.e. two independent variables:
In a study of the best way to keep fields free of weeds for an entire growing season, a farmer treated test plots in 10 fields with either five different concentrations of weedkiller (independent variable 1) or five different length blasts with a flamethrower (independent variable 2). At the end of they growing season, the number of weeds per square metre were counted. To exclude bias (e.g. pre-existing seedbank in the soil), the following year, the farmer repeated the experiment but this time the treatments the fields received were reversed:
Treatment: Weedkiller Flamethrower
Severity: 1 2 3 4 5 1 2 3 4 5
Field1 10 15 18 22 37 9 13 13 18 22
Field2 10 18 10 42 60 7 14 20 21 32
Field3 7 11 28 31 56 9 13 24 30 35
Field4 9 19 36 45 60 7 14 9 20 25
Field5 15 14 29 33 37 14 13 20 22 29
Field6 14 13 26 26 49 5 12 17 16 33
Field7 9 12 19 37 48 5 15 12 17 24
Field8 9 18 22 31 39 13 13 14 17 17
Field9 12 14 24 28 53 12 13 21 19 22
Field10 7 11 21 23 45 12 14 20 21 29
SPSS Data View:
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret:
Analyze: General Linear Model: Repeated Measures
Define Within Subject Factors (remember, "factor" = test or treatment):
Treatment, (2 treatments, weedkiller or flamethrower) (SPSS only allows 8 characters for the name)
Severity (5 different severities):
Click Define and define Within Subject Variables:
As above, there are no post hoc tests for repeated measures ANOVA in SPSS, but via the Options button, we can apply a Bonferroni correction to the probability at which you accept any of the tests:
Output:
Mauchly's Test of Sphericity(b)
Measure: MEASURE_1 Within Subjects Effect Mauchly's W Approx. Chi-Square df Sig. Epsilon
Greenhouse-Geisser Huynh-Feldt Lower-bound
treatmen 1.000 .000 0 . 1.000 1.000 1.000
severity .092 17.685 9 .043 .552 .740 .250
treatmen * severity .425 6.350 9 .712 .747 1.000 .250
The outcome of Mauchly’s test is significant (p <.05) for the severity of treatment, so we need to correct the F-values for this, but not for the treatments themselves.
Tests of Within-Subjects Effects Source
Type III Sum of Squares df Mean Square F Sig.
treatmen Sphericity Assumed 1730.560 1 1730.560 34.078 .000
Greenhouse-Geisser 1730.560 1.000 1730.560 34.078 .000
Huynh-Feldt 1730.560 1.000 1730.560 34.078 .000
Lower-bound 1730.560 1.000 1730.560 34.078 .000
Error(treatmen) Sphericity Assumed 457.040 9 50.782
Greenhouse-Geisser 457.040 9.000 50.782
Huynh-Feldt 457.040 9.000 50.782
Lower-bound 457.040 9.000 50.782
severity Sphericity Assumed 9517.960 4 2379.490 83.488 .000
Greenhouse-Geisser 9517.960 2.209 4309.021 83.488 .000
Huynh-Feldt 9517.960 2.958 3217.666 83.488 .000
Lower-bound 9517.960 1.000 9517.960 83.488 .000
Error(severity) Sphericity Assumed 1026.040 36 28.501
Greenhouse-Geisser 1026.040 19.880 51.613
Huynh-Feldt 1026.040 26.622 38.541
Lower-bound 1026.040 9.000 114.004
treatmen * severity Sphericity Assumed 1495.240 4 373.810 20.730 .000
Greenhouse-Geisser 1495.240 2.989 500.205 20.730 .000
Huynh-Feldt 1495.240 4.000 373.810 20.730 .000
Lower-bound 1495.240 1.000 1495.240 20.730 .001
Error(treatmen*severity) Sphericity Assumed 649.160 36 18.032
Greenhouse-Geisser 649.160 26.903 24.129
Huynh-Feldt 649.160 36.000 18.032
Lower-bound 649.160 9.000 72.129
Since there was no violation of sphericity, we can look at the comparison of the two treatments without any correction. The significance value shows (0.000) that there was a significant difference between the two treatments, but does not tell us which treatments produced this effect.
The output also tells us the effect of the severity of treatments, but remember there was a violation of sphericity here, so we must look at the corrected F-ratios. All of the corrected values are highly significant and so we can use the Greenhouse-Geisser corrected values as these are the most conservative.
Pairwise Comparisons (I) severity (J) severity Mean Difference (I-J) Std. Error Sig.(a) 95% Confidence Interval for Difference(a)
Lower Bound Upper Bound
1 2 -4.200(*) .895 .011 -7.502 -.898
3 -10.400(*) 1.190 .000 -14.790 -6.010
4 -16.200(*) 1.764 .000 -22.709 -9.691
5 -27.850(*) 2.398 .000 -36.698 -19.002
2 1 4.200(*) .895 .011 .898 7.502
3 -6.200(*) 1.521 .028 -11.810 -.590
4 -12.000(*) 1.280 .000 -16.723 -7.277
5 -23.650(*) 2.045 .000 -31.197 -16.103
3 1 10.400(*) 1.190 .000 6.010 14.790
2 6.200(*) 1.521 .028 .590 11.810
4 -5.800 1.690 .075 -12.036 .436
5 -17.450(*) 2.006 .000 -24.852 -10.048
4 1 16.200(*) 1.764 .000 9.691 22.709
2 12.000(*) 1.280 .000 7.277 16.723
3 5.800 1.690 .075 -.436 12.036
5 -11.650(*) 1.551 .000 -17.373 -5.927
5 1 27.850(*) 2.398 .000 19.002 36.698
2 23.650(*) 2.045 .000 16.103 31.197
3 17.450(*) 2.006 .000 10.048 24.852
4 11.650(*) 1.551 .000 5.927 17.373
* The mean difference is significant at the .05 level.
a Adjustment for multiple comparisons: Bonferroni.
This shows that there was only one pair for which there was no significant difference: 40% weedkiller followed by 2 minutes flame thrower, and 2 minutes flame thrower followed by 40% weedkiller. The differences for all the other pairs are significant. It does not matter if the farmer uses weedkiller or a flamethrower, but how much weedkiller and how long a burst of flame does make a difference to weed control.
Formal report:
There was a significant main effect of the type of treatment, F(1, 9) = 34.08, p < .001.
There was a significant main effect of the severity of treatment, F(2.21, 19.88) = 83.49, p <.001.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment