Contingency Table:
Test for Independence

AIMS: In this section we will answer the following questions:

  1. What is Contingency Table Test (Test for Independece) ?

  2. How to carry out Contingency Table Test (Test for Independence) ?

  3. How to use GDC for the Test for Independence?

  4. Can I have some exercises?

  1. If we have data that are measurable (eg. height, time, weight, age, test score, etc.) then we can use coefficient of correlation (r) to measure the strength of linear association between two sets of data.

  2. If we have two sets of rankings then we can use Spearman's Rank to measure the strength of association between them.

  3. However, both of these techniques are useless when you like to find out whether or not two sets of categorical data are related. For example, does having fair hair independent to having having dark colour (dark brown or black) eyes? Another example could be, does taking pill x has any effect on being energetic?

  4. Here, we will need to use Contingency Table or the Test for Independence.

Show me a roadmap of this technique?

  1. Say, you like to find out whether or not sleeping early (defined to be on or before 10:30 pm each night) is related to achieving above average score (equal or above 75%) in Maths. Write down your null hypothesis (H0) and alternative hypothesis (H1)as follow:
    H0: Sleeping early and achieving above average score in Maths are independent.
    H
    1: Sleeping early and achieving above average score in Maths are NOT independent.
    The null hypothesis (H0) is always about two factors being independent to each another.

  2. Display your data (often called observations, O) in a table.

  3. Calculate the expected values (E) of each category by assuming that the null hypothesis is true.

  4. Calculate the chi-squared statistics by comparing the observed with the expected values.

  5. Basically, the chi-squared statistics is a measure of how different the observed (O) from the expected values. If the difference is big than the value of this statistics will be large and if the difference is small then the value is small.

  6. If the difference between O and E is small then we fail to reject our null hypothesis. That is the our observations are very similar to the expected values.

  7. If there are substantial differences between the observations O and expected values E the the chi-squared statistics is large. If the difference between O and E is large then we reject the null hypothesis and accept the alternative hypothesis. In another word, the two factors are not independent. Thus, there is some sort of relationship between the two sets of factors.
    Having said so, this statistical test is not able to tell us what sort of relationship is there between O and E. The test only allows us to determine whether or not two factors are independent of each other.

  8. Unlike t-test or z-test, this is always done as a one-tailed (upper tail) test despite the alternative being “NOT independent to each other.”

What is the theory behind Expected Frequencies?

  1. Let say we like to find out whether or not factor A and factor B are independent. Let say for simplicity, we assume factor A to have two levels (m=2); level A1 and Level A2. In our above example this could be 'sleep early' and 'does not sleep early." For the sake of simplicity, factor B also has two levels (n=2); level B1 and level B2. These could be 'above average math score' and 'not above average math score.'

  2. Write down
    H0: Factor A and factor B are independent.
    H1: Factor A and factor B are NOT independent.

  3. Arrange the observed data (O) or observed frequencies into a table.

    Table 1. Theory:

     

    B1

    B2

    total

    A1

    a

    b

    a+b

    A2

    c

    d

    c+d

    total

    a+c

    b+d

    n

    Table 2. Example:Observed Data.

     

    B1

    B2

    total

    A1

    83

    35

    118

    A2

    22

    60

    82

    total

    105

    95

    200

  4. The example in this case has 83 observations that have both level A1 and level B1. [83 students said that they have sleep early (A1) and achieve above average score in math ( B1)] They are also 22 observations with both level A2 and B1 [22 students do not sleep early (A2) and achieve above average math score (B1)] and so on. There are 200 observations in the survey with 105 in B1 and 95 in B2. There are 118 and 82 observations in A1 and A2 respectively.
    The total number of observation is represented by n and n = a+b+c+d.

  5. If the null hypothesis is true then factor A and factor B are independent. It follows then A1 and B1 are independent. A2 and B2 are also independent. If A1 and B1 are independent then P(A1 ∩ B1) = P(A1) x P(B1). Thus,
    P(A1 ∩ B1) = [ (a+b)/n ] x [ (a+c)/n ] .

  6. However, we are not only interested in the P(A1 ∩ B1) but the expected value of A1 and B1. The expected value of A1 and B1 is the probability of A1 and B1 occurring in the total number of observation. Hence,
    the expected value of A1 and B1 = P(A1 ∩ B1) x n
      = [ (a+b)/n ] x [ (a+c)/n ] x n
      = (a+b)(a+c)/n

  7. Similarly, we obtained the expected values for the other cells. Here are the expected frequencies:

 

B1

B2

total

A1

(118)(105)= 61.95
200

(118)(95)= 56.05
200

118

A2

(82)(105)= 43.05
200

(82)(95) = 38.95
200

82

total

105

95

200

Table 3: Worked Example.

 

B1

B2

total

A1

(a+b)(a+c)
n

(a+b)(b+d)
n

a+b

A2

(c+d)(a+c)
n

(c+d)(b+d)
n

c+d

total

a+c

b+d

n

Table 4: Theory.

How do I complete this Test for Independence?

  1. The chi-squared statistics (U) is

    U = ∑

    (O - E)2


    E

    where U has a chi-squared with υ degree of freedom; χ2 (υ) where υ = (m-1)(n-1). Thus, the degree of freedom for the example here is (2-1)(2-1) = 1.

  2. Using the example above,

    U =

    (83-61.95)2


    +

    (35-56.05)2


    +

    (22-43.05)2


    +

    (60-38.95)2


    61.95

    56.05

    43.05

    38.95

    U ≈ 36.73

  3. Well, does this represents a large or small difference? To figure that out, we need to use a critical value (c). You can think of this critical value c as the cut-off point where beyond a certain number the value U is considered large. This critical value c is decided by two factors, the degree of freedom υ and the level of significant that we desired to have. Usually, we will choose 5% significant level, that is, we are only willing to accept 5% or less probability that we rejected the null hypothesis wrongly, i.e. when the null hypothesis is actually true. In our case, the critical value is determined by 1 degree of freedom at 5% significant level. From the chi-squared table , the corresponding critical value with 1 degree of freedom and 5% significant level is 3.841.

    Reading from the chi-squared table is tedious. Is there another way that I can use to determine the size of U?

    The answer is yes. There are three methods using Excel.

    Method 1.

    1. Open up Excel.

    2. Use any cell and click [ fx ] on the ruler. This is the "Paste Function" button.

    3. Select "Statistical-CHIINV." CHIINV(probability,degree_of_freedom) takes two variables; probability which is our level of significant and degree of freedom. It returns the critical value. So for the critical value at 5% significant level at 1 degree of freedom, we will enter probability = 0.05 and degree_of_freedom = 1.

    4. Carry out the comparison between U and this critical value as outlined in the procedure above.

    Method 2.

    1. Open up Excel.

    2. Use any cell and click [ fx ] on the ruler. This is the "Paste Function" button.

    3. Select "Statistical-CHIDIST." CHIDIST(x,degree_of_freedom) takes two variables; x, this is our U value and degree of freedom. It returns the probability of U at one tail on the chi-squared distribution.

    In this example, the returned probability is 1.36E-09. Your report should be as follows:

    U = 36.73 ( p<0.00005) and we thus reject the null hypothesis at 5% significant level because the p-value of U=36.73 is smaller than 1%.

    Method 3.

    Click here to learn how to carry out a contingency table using Excel. [Also study the solutions (in Excel format) to the exercises below. This Excel file can also be used as template to carry out Contingency Test for categorical data with m=2 and n=2 (2 by 2 table).]

  4. Our chi-squared statistics U of 36.73 is larger than the critical value 3.841 at 5% significant level. Hence, we reject our null hypothesis at 5% significant level (see diagram). We thus failed to reject the alternative hypothesis that sleeping early and achieving above average score in Maths are NOT independent. Note that this is a one-tailed (upper tail) test.

  5. This test of independence can only tell us whether or not two factors are independent. If these two factors are not independent then it suggests that there is a relationship between them. However, the test of independence is unable to tell us the form of relationship that may exist between these two factors.

UP

How to use GDC for the Test for Independence?


The Ti GDC can also be used to carry our the Test for Independence.
(1) Enter the observed data into a Matrix A by pressing [2nd][x -1] for MATRIX and then use arrow to move the cursor to the right similar to Screen 1.
(2) At EDIT 1:[A] press [ENTER] and enter all the relevant information as in screen 2. Here we have used the data from Table 2 above.
(3) Press [2nd][MODE] to quit.
(4) Now we can move to the next step. Press [STAT] to obtain Screen 3.Use the arrow to move the cursor to TESTS and then down to C: χ2 -Test as in Screen 4 and press [ENTER].
(5) Now Screen 5 will appear. The observed data has already been entered into matrix A and the expected values in matrix B is calculated by the GDC. Use arrow to move the cursor to Calculate and press [ENTER].



(6) The results of this test is given in Screen 6.
(7) To have a look at the expected values in matrix B. We press [2nd][MODE] to quit follow by [2nd][x -1] for MATRIX and then use arrow to move to 2:[B] as in screen 7.
(8) Press [ENTER] to obtain the expected values as in Screen 8. These values should be identical to those calculated by hand in Table 3 above.
(9) For completeness we can also have a look at the χ2 distribution. To obtain screen 9. Press [2nd][MODE] to quit the matrix B. Repeat step (4) above and at screen 5 select Draw follow by [ENTER]. The shaded area on the graph is so small in the right that its p-value (type I) is close to zero (1.36 x 10 -9 in screen 6).

UP


Exercises:

1. A cake producer believes that the ability to distinguish butter from margarine by taste is related to gender. A group of 400 blind-folded individuals are asked to distinguish butter from margarine by taste. Here are the results:

Gender

Female

Male

able to tell the difference

120

108

unable to tell the difference

80

92

Let the null hypothesis be " the ability to tell the distinguish butter from margarine by taste is independent of gender." Use the Contingency Table Test to verify the cake producer's belief at 5% significant level.

 

2. In a clinical trial of a drug for arthritis, 168 patients received treatment with drug X, and a control group of 132 received treatment with placebo (which is non-active). Their conditions were checked after a period of time.

 

Drug X

Placebo

Improved

119

52

No Improvement

49

80

(a) Write out the null hypothesis and alternative hypothesis.

(b) Use Contingency Table Test to verify whether or not Drug X is effective at 5% significant level.

 

3. A teacher conducted a survey to see whether or not students who complete theory assignments regularly also tend to score at or above average in test. This is her findings:

 

Completes assignments regularly

Does not complete assignments regularly

score at or above average

21

7

score below average

7

12

(a) Write out your null hypothesis and alternative hypothesis.

(b) Use Contingency Test to verify the significance of the above results at 5% significant level.
[Solution is not provided. Check your answer with the Excel template.]

contingency_solution.xls.

UP