Yang in Scotland: Bonferroni correction in SPSS

ANOVA with SPSS

Never, ever, run any statistical test without performing EDA first!

What's wrong with t-tests?

Nothing, except ...

If you want to compare three or more groups using t-tests with the usual 0.05 level of significance, you would have to compare the three groups pairwise (A to B, A to C, B to C), so the chance of getting the wrong result would be:

1 - (0.95 x 0.95 x 0.95) = 14.3%
If you wanted to compare four or more groups, the chance of getting the wrong result would be (0.95)6 = 26%, and for five groups, 40%. Not good, is it? So we use ANOVA. Never perform multiple t-tests: Anyone on this module discovered performing multiple t-tests when they should use ANOVA will be shot!

ANalysis Of VAriance (ANOVA) is such an important statistical method that it would be easy to spend a whole module on this test alone. Like the t-test, ANOVA is a parametric test which assumes:

•data is numerical data representing samples from normally distributed populations

•the variances of the groups are "similar"

•the sizes of the groups are "similar"

•the groups should be independent

so it's important to carry out EDA before starting AVOVA! In fact, ANOVA is quite a robust procedure, so as long as the groups are similar, the test is normally reliable.

ANOVA tests the null hypothesis that the means of all the groups being compared are equal, and produces a statistic called F which is equivalent to the t-statistic from a t-test. But there's a catch. If the means of all the groups tested by ANOVA are equal, fine. But if the result tells us to reject the null hypothesis, we still don't know which of the means differ. We solve this problem by performing what is known as a "post hoc" (after the event) test.

Reminder:

•Independent variable: Variables which are experimentally manipulated by an investigator are called independent variables.

•Dependent variable: Variables which are measured are called dependent variables (because they are presumed to depend on the value of the independent variable).

ANOVA jargon:

•Way = an independent variable, so a one-way ANOVA has one independent variable, two-way ANOVA has two independent variables, etc. Simple ANOVA tests the hypothesis that means from two or more samples are equal (drawn from populations with the same mean). Student's t-test is actually a particular application of one-way ANOVA (two groups compared).

•Factor = a test or measurement. Single-factor ANOVA tests whether the means of the groups being compared are equal and returns a yes/no answer, two-factor ANOVA simultaneously tests two or more factors, e.g. tumour size after treatment with different drugs and/or radiotherapy (drug treatment is one factor and radiotherapy is another). So, "factor" and "way" are alternative terms for the same thing (inpependent variables).

•Repeated measures: Used when members of a sample are measured under different conditions. As the sample is exposed to each condition, the measurement of the dependent variable is repeated. Using standard ANOVA is not appropriate because it fails to take into account correlation between the repeated measures, violating the assumption of independence. This approach can be used for several reasons, e.g. where research requires repeated measures, such as longitudinal research which measures each sample member at each of several ages - age is a repeated factor. This is comparable to a paired t-test.

The array of options for different ANOVA tests in SPSS is confusing, so I'll go through the most important bits using some examples.

One-Way / Single-Factor ANOVA:

Data:

Pain Scores for Analgesics

Drug: Pain Score:

Diclofenac 0, 35, 31, 29, 20, 7, 43, 16

Ibuprophen 30, 40, 27, 25, 39, 15, 30, 45

Paracetamol 16, 33, 25, 32, 21, 54, 57, 19

Asprin 55, 58, 56, 57, 56, 53, 59, 55

Since it would be unethical to withhold pain relief, there is no control group and we are just interested in knowing whether one drug performs better (lower pain score) than another, so we need to perform a one-way/single-factor ANOVA.

We enter this data into SPSS using dummy values (1, 2, 3, 4) for the drugs so this numeric data can be used in the ANOVA:

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!

EDA (Analyzer: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:

Analyze: Compare Means: One-Way ANOVA

Dependent variable: Pain Score

Factor: Drug:

•SPSS allows many different post hoc tests. Click Post Hoc and select the Tukey and Games-Howell tests.

◦The Tukey test is powerful and widely accepted, but is parametric in that it assumes that the population variances are equal. It also assumes that the sample sizes are equal. If this is not the case, you should use Gabriel's procedure, or if the sizes are very different, use Hochberg's GT2.

◦Games-Howell does not assume population variances are equal or that sample sizes are equal, so is a good alternative if this turns out to be the case.

•Click Options and select Homogeneity of Variance Test, Brown-Forsythe and Welch. The homogeneity of variance test is important since this is an assumption of ANOVA, but if this assumption turns out to be broken, the Brown-Forsythe and Welch options will display alternative versions of the F statistic which means you may still be able to use the result.

•Click OK to run the tests.

Output:

Test of Homogeneity of Variances: Pain Levene Statistic df1 df2 Sig.

4.837 3 28 .008

The significance value for homogeneity of variances is <.05, so the variances of the groups are significantly different. Since this is an assumption of ANOVA, we need to be very careful in interpreting the outcome of this test:

ANOVA: Pain

Sum of Squares df Mean Square F Sig.

Between Groups 4956.375 3 1652.125 11.967 .000

Within Groups 3865.500 28 138.054

Total 8821.875 31

This is the main ANOVA result. The significance value comparing the groups (drugs) is <.05, so we could reject the null hypothesis (there is no difference in the mean pain scores with the four drugs). However, since the variances are significantly different, this might be the wrong answer. Fortunately, the Welch and Brown-Forsythe statistics can still be used in these circumstances:

Robust Tests of Equality of Means: Pain

Statistic df1 df2 Sig.

Welch 32.064 3 12.171 .000

Brown-Forsythe 11.967 3 18.889 .000

The significance value of these are both <.05, so we still reject the null hypothesis. However, this result does not tell us which drugs are responsible for the difference, so we need the post hoc test results:

Multiple Comparisons

Dependent Variable: Pain

(I) Drug (J) Drug Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval

Lower Bound Upper Bound

Tukey HSD 1 2 -8.750 5.875 .457 -24.79 7.29

3 -9.500 5.875 .386 -25.54 6.54

4 -33.500(*) 5.875 .000 -49.54 -17.46

2 1 8.750 5.875 .457 -7.29 24.79

3 -.750 5.875 .999 -16.79 15.29

4 -24.750(*) 5.875 .001 -40.79 -8.71

3 1 9.500 5.875 .386 -6.54 25.54

2 .750 5.875 .999 -15.29 16.79

4 -24.000(*) 5.875 .002 -40.04 -7.96

4 1 33.500(*) 5.875 .000 17.46 49.54

2 24.750(*) 5.875 .001 8.71 40.79

3 24.000(*) 5.875 .002 7.96 40.04

Games-Howell 1 2 -8.750 6.176 .513 -27.05 9.55

3 -9.500 7.548 .602 -31.45 12.45

4 -33.500(*) 5.194 .001 -50.55 -16.45

2 1 8.750 6.176 .513 -9.55 27.05

3 -.750 6.485 .999 -20.09 18.59

4 -24.750(*) 3.471 .001 -36.03 -13.47

3 1 9.500 7.548 .602 -12.45 31.45

2 .750 6.485 .999 -18.59 20.09

4 -24.000(*) 5.558 .014 -42.26 -5.74

4 1 33.500(*) 5.194 .001 16.45 50.55

2 24.750(*) 3.471 .001 13.47 36.03

3 24.000(*) 5.558 .014 5.74 42.26

* The mean difference is significant at the .05 level.

The Tukey test relies on homogeneity of variance, so we ignore these results. The Games-Howell post-hoc test does not rely on homogeneity of variance (this is why we used two different post-hoc tests) and so can be used. SPSS kindly flags (*) which differences are significant!

Result: Drug 4 (Asprin) produces significantly different result from the other three drugs:

Formal Reporting: When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:

There is a significant difference between the pain scores for asprin and the other three drugs tested, F(3,28) = 11.97, p < .05.

Two-Factor ANOVA

Do anti-cancer drugs have different effects in males and females?

Data:

Drug: cisplatin vinblastine 5-fluorouracil

Gender:

Female Male Female Male Female Male

Tumour

Size: 65 50 70 45 55 35

70 55 65 60 65 40

60 80 60 85 70 35

60 65 70 65 55 55

60 70 65 70 55 35

55 75 60 70 60 40

60 75 60 80 50 45

50 65 50 60 50 40

We enter this data into SPSS using dummy values for the drugs (1, 2, 3) and genders (1,2) so the coded data can be used in the ANOVA:

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!

EDA (Analyze: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:

Analyze: General Linear Model: Univariate

Dependent variable: Tumour Diameter

Fixed Factors: Gender, Drug:

Also select:

Post Hoc: Tukey and Games-Howell:

Options:

Display Means for: Gender, Drug, Gender*Drug

Descriptive Statistics

Homogeneity tests:

Output:

Levene's Test of Equality of Error Variances(a)

Dependent Variable: Diameter F df1 df2 Sig.

1.462 5 42 .223

Tests the null hypothesis that the error variance of the dependent variable is equal across groups.

a Design: Intercept+Gender+Drug+Gender * Drug

The significance result for homogeneity of variance is >.05, which shows that the error variance of the dependent variable is equal across the groups, i.e. the assumption of the ANOVA test has been met.

Tests of Between-Subjects Effects

Dependent Variable: Diameter Source Type III Sum of Squares df Mean Square F Sig.

Corrected Model 3817.188(a) 5 763.438 10.459 .000

Intercept 167442.188 1 167442.188 2294.009 .000

Gender 42.188 1 42.188 .578 .451

Drug 2412.500 2 1206.250 16.526 .000

Gender * Drug 1362.500 2 681.250 9.333 .000

Error 3065.625 42 72.991

Total 174325.000 48

Corrected Total 6882.813 47

a R Squared = .555 (Adjusted R Squared = .502)

The highlighted values are significant (<.05), but there is no effect of gender (p = 0.451). Again, this does not tell us which drugs behave differently, so again we need to look at the post hoc tests:

Multiple Comparisons

Dependent Variable: Diameter

(I) Drug (J) Drug Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval

Lower Bound Upper Bound

Tukey HSD cisplatin vinblastine -1.25 3.021 .910 -8.59 6.09

5-flourouracil 14.38(*) 3.021 .000 7.04 21.71

vinblastine cisplatin 1.25 3.021 .910 -6.09 8.59

5-flourouracil 15.63(*) 3.021 .000 8.29 22.96

5-flourouracil cisplatin -14.38(*) 3.021 .000 -21.71 -7.04

vinblastine -15.63(*) 3.021 .000 -22.96 -8.29

Games-Howell cisplatin vinblastine -1.25 3.329 .925 -9.46 6.96

5-flourouracil 14.38(*) 3.534 .001 5.64 23.11

vinblastine cisplatin 1.25 3.329 .925 -6.96 9.46

5-flourouracil 15.63(*) 3.699 .001 6.50 24.75

5-flourouracil cisplatin -14.38(*) 3.534 .001 -23.11 -5.64

vinblastine -15.63(*) 3.699 .001 -24.75 -6.50

Based on observed means.

* The mean difference is significant at the .05 level.

In this example, we can use the Tukey or Games-Howell results. Again, SPSS helpfully flags which results have reached statistical significance. We already know from the main ANOVA table that the effect of gender is not significant, but the post hoc tests show which drugs produce significantly different outcomes.

Formal Reporting: When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:

There is a significant difference between the tumour diameter for 5-flourouracil and the other two drugs tested, F(5,47) = 10.46, p < .05.

Repeated Measures ANOVA

Remember that one of the assumptions of ANOVA is independence of the groups being compared. In lots of circumstances, we want to test the same thing repeatedly, e.g:

•Patients with a chronic disease after 3, 6 and 12 months of drug treatment

•Repeated sampling from the same location, e.g. spring, summer, autumn and winter

•etc

This type of study reduces variability in the data and so increases the power to detect effects, but violates the assumption of independence, so as with the paired t-test, we need to use a special form of ANOVA called repeated measures. In a parametric test, the assumption that the relationship between pairs of groups is equal is called "sphericity". Violating sphericity means that the F statistic cannot be compared to the normal tables of F, and so software cannot calculate a significance value. SPSS includes a procedure called Mauchly's test which tells us if the assumption of sphericity has been violated:

•If Mauchly’s test statistic is significant (i.e. p 0.05) we conclude that the condition of sphericity has not been met.

•If, Mauchly’s test statistic is nonsignificant (i.e. p >.05) it is reasonable to conclude that the variances of differences are not significantly different.

If Mauchly’s test is significant then we cannot trust the F-ratios produced by SPSS unless we apply a correction (which, fortunately, SPSS helps us to do).

One-Way Repeated Measures ANOVA

i.e. one independent variable, e.g. pain score after surgery:

Patient1 Patient2 Patient3

1 3 1

2 5 3

4 6 6

5 7 4

5 9 1

6 10 3

This data can be entered directly into SPSS. Note that each column represents a repeated measures variable (patients in this case). There is no need for a coding variable (as with between-group designs, above):

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret! Next:

Analyze: General Linear Model: Repeated Measures

Within-Subject factor name: Patient

Number of Levels: 3 (because there are 3 patients)

Click Add, then Define (factors):

There are no proper post hoc tests for repeated measures variables in SPSS. However, via the Options button, you can use the paired t-test procedure to compare all pairs of levels of the independent variable, and then apply a Bonferroni correction to the probability at which you accept any of these tests. The resulting probability value should be used as the criterion for statistical significance. A ‘Bonferroni correction’ is achieved by dividing the probability value (usually 0.05) by the number of tests conducted, e.g. if we compare all levels of the independent variable of these data, we make three comparisons and so the appropriate significance level is 0.05/3 = 0.0167. Therefore, we accept t-tests as being significant only if they have a p value <0.0167.

Output:

Mauchly's Test of Sphericity Within Subjects Effect Mauchly's W Approx. Chi-Square df Sig. Epsilon

Greenhouse-Geisser Huynh-Feldt Lower-bound

patient .094 9.437 2 .009 .525 .544 .500

Mauchly’s test is significant (p <.05) so we conclude that the assumption of sphericity has not been met.

Tests of Within-Subjects Effects Source

Type III Sum of Squares df Mean Square F Sig.

patient Sphericity Assumed 44.333 2 22.167 8.210 .008

Greenhouse-Geisser 44.333 1.050 42.239 8.210 .033

Huynh-Feldt 44.333 1.088 40.752 8.210 .031

Lower-bound 44.333 1.000 44.333 8.210 .035

Error(patient) Sphericity Assumed 27.000 10 2.700

Greenhouse-Geisser 27.000 5.248 5.145

Huynh-Feldt 27.000 5.439 4.964

Lower-bound 27.000 5.000 5.400

Because the significance values are <.05, we conclude that there was a significant difference between the three patients, but this test does not tell us which patients differed from each other. The next issue is which of the three corrections to use. Going back to Mauchly's test:

•If epsilon is >0.75, use the Huynh-Feldt correction.

•If epsilon is <0.75, or nothing is known about sphericity at all, use the Greenhouse-Geisser correction.

•In this example, the epsilon values from Mauchly's test values are 0.525 and 0.544, both <0.75, so we use the Greenhouse-Geisser corrected values. Using this correction, F is still significant because its p value is 0.033, which is <.05.

Post Hoc Tests:

Pairwise Comparisons (I) patient (J) patient Mean Difference (I-J) Std. Error Sig.(a) 95% Confidence Interval for Difference(a)

Lower Bound Upper Bound

1 2 -2.833(*) .401 .003 -4.252 -1.415

3 .833 .946 1.000 -2.509 4.176

2 1 2.833(*) .401 .003 1.415 4.252

3 3.667 1.282 .106 -.865 8.199

3 1 -.833 .946 1.000 -4.176 2.509

2 -3.667 1.282 .106 -8.199 .865

Based on estimated marginal means

* The mean difference is significant at the .05 level.

a Adjustment for multiple comparisons: Bonferroni.

Formal reporting:

Mauchly’s test indicated that the assumption of sphericity had been violated (chi-square = 9.44, p <.05), therefore degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (epsilon = 0.53). The results show that the pain scores of the three patients differed significantly, F(1.05, 5.25) = 8.21, p <.05. Post hoc tests revealed that although the pain score of Patient2 was significantly higher than that of than Patient1 (p<.001), Patient3's score was not significantly differently from either of the other patients (both p>.05).

Two-Way Repeated Measures ANOVA

i.e. two independent variables:

In a study of the best way to keep fields free of weeds for an entire growing season, a farmer treated test plots in 10 fields with either five different concentrations of weedkiller (independent variable 1) or five different length blasts with a flamethrower (independent variable 2). At the end of they growing season, the number of weeds per square metre were counted. To exclude bias (e.g. pre-existing seedbank in the soil), the following year, the farmer repeated the experiment but this time the treatments the fields received were reversed:

Treatment: Weedkiller Flamethrower

Severity: 1 2 3 4 5 1 2 3 4 5

Field1 10 15 18 22 37 9 13 13 18 22

Field2 10 18 10 42 60 7 14 20 21 32

Field3 7 11 28 31 56 9 13 24 30 35

Field4 9 19 36 45 60 7 14 9 20 25

Field5 15 14 29 33 37 14 13 20 22 29

Field6 14 13 26 26 49 5 12 17 16 33

Field7 9 12 19 37 48 5 15 12 17 24

Field8 9 18 22 31 39 13 13 14 17 17

Field9 12 14 24 28 53 12 13 21 19 22

Field10 7 11 21 23 45 12 14 20 21 29

SPSS Data View:

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret:

Analyze: General Linear Model: Repeated Measures

Define Within Subject Factors (remember, "factor" = test or treatment):

Treatment, (2 treatments, weedkiller or flamethrower) (SPSS only allows 8 characters for the name)

Severity (5 different severities):

Click Define and define Within Subject Variables:

As above, there are no post hoc tests for repeated measures ANOVA in SPSS, but via the Options button, we can apply a Bonferroni correction to the probability at which you accept any of the tests:

Output:

Mauchly's Test of Sphericity(b)

Measure: MEASURE_1 Within Subjects Effect Mauchly's W Approx. Chi-Square df Sig. Epsilon

Greenhouse-Geisser Huynh-Feldt Lower-bound

treatmen 1.000 .000 0 . 1.000 1.000 1.000

severity .092 17.685 9 .043 .552 .740 .250

treatmen * severity .425 6.350 9 .712 .747 1.000 .250

The outcome of Mauchly’s test is significant (p <.05) for the severity of treatment, so we need to correct the F-values for this, but not for the treatments themselves.

Tests of Within-Subjects Effects Source

Type III Sum of Squares df Mean Square F Sig.

treatmen Sphericity Assumed 1730.560 1 1730.560 34.078 .000

Greenhouse-Geisser 1730.560 1.000 1730.560 34.078 .000

Huynh-Feldt 1730.560 1.000 1730.560 34.078 .000

Lower-bound 1730.560 1.000 1730.560 34.078 .000

Error(treatmen) Sphericity Assumed 457.040 9 50.782

Greenhouse-Geisser 457.040 9.000 50.782

Huynh-Feldt 457.040 9.000 50.782

Lower-bound 457.040 9.000 50.782

severity Sphericity Assumed 9517.960 4 2379.490 83.488 .000

Greenhouse-Geisser 9517.960 2.209 4309.021 83.488 .000

Huynh-Feldt 9517.960 2.958 3217.666 83.488 .000

Lower-bound 9517.960 1.000 9517.960 83.488 .000

Error(severity) Sphericity Assumed 1026.040 36 28.501

Greenhouse-Geisser 1026.040 19.880 51.613

Huynh-Feldt 1026.040 26.622 38.541

Lower-bound 1026.040 9.000 114.004

treatmen * severity Sphericity Assumed 1495.240 4 373.810 20.730 .000

Greenhouse-Geisser 1495.240 2.989 500.205 20.730 .000

Huynh-Feldt 1495.240 4.000 373.810 20.730 .000

Lower-bound 1495.240 1.000 1495.240 20.730 .001

Error(treatmen*severity) Sphericity Assumed 649.160 36 18.032

Greenhouse-Geisser 649.160 26.903 24.129

Huynh-Feldt 649.160 36.000 18.032

Lower-bound 649.160 9.000 72.129

Since there was no violation of sphericity, we can look at the comparison of the two treatments without any correction. The significance value shows (0.000) that there was a significant difference between the two treatments, but does not tell us which treatments produced this effect.

The output also tells us the effect of the severity of treatments, but remember there was a violation of sphericity here, so we must look at the corrected F-ratios. All of the corrected values are highly significant and so we can use the Greenhouse-Geisser corrected values as these are the most conservative.

Pairwise Comparisons (I) severity (J) severity Mean Difference (I-J) Std. Error Sig.(a) 95% Confidence Interval for Difference(a)

Lower Bound Upper Bound

1 2 -4.200(*) .895 .011 -7.502 -.898

3 -10.400(*) 1.190 .000 -14.790 -6.010

4 -16.200(*) 1.764 .000 -22.709 -9.691

5 -27.850(*) 2.398 .000 -36.698 -19.002

2 1 4.200(*) .895 .011 .898 7.502

3 -6.200(*) 1.521 .028 -11.810 -.590

4 -12.000(*) 1.280 .000 -16.723 -7.277

5 -23.650(*) 2.045 .000 -31.197 -16.103

3 1 10.400(*) 1.190 .000 6.010 14.790

2 6.200(*) 1.521 .028 .590 11.810

4 -5.800 1.690 .075 -12.036 .436

5 -17.450(*) 2.006 .000 -24.852 -10.048

4 1 16.200(*) 1.764 .000 9.691 22.709

2 12.000(*) 1.280 .000 7.277 16.723

3 5.800 1.690 .075 -.436 12.036

5 -11.650(*) 1.551 .000 -17.373 -5.927

5 1 27.850(*) 2.398 .000 19.002 36.698

2 23.650(*) 2.045 .000 16.103 31.197

3 17.450(*) 2.006 .000 10.048 24.852

4 11.650(*) 1.551 .000 5.927 17.373

* The mean difference is significant at the .05 level.

a Adjustment for multiple comparisons: Bonferroni.

This shows that there was only one pair for which there was no significant difference: 40% weedkiller followed by 2 minutes flame thrower, and 2 minutes flame thrower followed by 40% weedkiller. The differences for all the other pairs are significant. It does not matter if the farmer uses weedkiller or a flamethrower, but how much weedkiller and how long a burst of flame does make a difference to weed control.

Formal report:

There was a significant main effect of the type of treatment, F(1, 9) = 34.08, p < .001.

There was a significant main effect of the severity of treatment, F(2.21, 19.88) = 83.49, p <.001.

Yang in Scotland

Tuesday, January 19, 2010

Bonferroni correction in SPSS

No comments:

Post a Comment

Blog Archive

This is Yang