Contingency
Table: 
AIMS: In this section we will answer the following questions:
If we have data that are measurable (eg. height, time, weight, age, test score, etc.) then we can use coefficient of correlation (r) to measure the strength of linear association between two sets of data.
If we have two sets of rankings then we can use Spearman's Rank to measure the strength of association between them.
However, both of these techniques are useless when you like to find out whether or not two sets of categorical data are related. For example, does having fair hair independent to having having dark colour (dark brown or black) eyes? Another example could be, does taking pill x has any effect on being energetic?
Here, we will need to use Contingency Table or the Test for Independence.
Say,
you like to find out whether or not sleeping early (defined to be on or
before 10:30 pm each night) is related to achieving above average score
(equal or above 75%) in Maths. Write down your null hypothesis (H_{0})
and alternative hypothesis (H_{1})as follow:
H_{0}: Sleeping early and achieving above average
score in Maths are independent.
H_{1}: Sleeping early and achieving above average
score in Maths are NOT independent.
The null hypothesis (H_{0}) is always
about two factors being independent to each another.
Display your data (often called observations, O) in a table.
Calculate the expected values (E) of each category by assuming that the null hypothesis is true.
Calculate the chisquared statistics by comparing the observed with the expected values.
Basically, the chisquared statistics is a measure of how different the observed (O) from the expected values. If the difference is big than the value of this statistics will be large and if the difference is small then the value is small.
If the difference between O and E is small then we fail to reject our null hypothesis. That is the our observations are very similar to the expected values.
If there are substantial differences
between the observations O and expected values E the the chisquared statistics
is large. If the difference between O and E is large then we reject the
null hypothesis and accept the alternative hypothesis. In another word,
the two factors are not independent. Thus, there is some sort of relationship
between the two sets of factors.
Having said so, this statistical test is not able to tell us what sort of
relationship is there between O and E. The test only allows us to determine
whether or not two factors are independent of each other.
Unlike ttest or ztest, this is always done as a onetailed (upper tail) test despite the alternative being “NOT independent to each other.”
Let say we like to find out whether or not factor A and factor B are independent. Let say for simplicity, we assume factor A to have two levels (m=2); level A1 and Level A2. In our above example this could be 'sleep early' and 'does not sleep early." For the sake of simplicity, factor B also has two levels (n=2); level B1 and level B2. These could be 'above average math score' and 'not above average math score.'
Write
down
H_{0}: Factor A and factor B are independent.
H_{1}: Factor A and factor B are NOT independent.
Arrange the observed data (O) or observed frequencies into a table.
Table 1. Theory:

Table 2. Example:Observed Data.

The
example in this case has 83 observations that have both level A1 and level
B1. [83 students said that they have sleep early (A1) and achieve above
average score in math ( B1)] They are also 22 observations with both level
A2 and B1 [22 students do not sleep early (A2) and achieve above average
math score (B1)] and so on. There are 200 observations in the survey with
105 in B1 and 95 in B2. There are 118 and 82 observations in A1 and A2 respectively.
The total number of observation is represented by n and n = a+b+c+d.
If
the null hypothesis is true then factor A and factor B are independent.
It follows then A1 and B1 are independent. A2 and B2 are also independent.
If A1 and B1 are independent then P(A1 ∩ B1) = P(A1) x P(B1). Thus,
P(A1 ∩ B1) = [ (a+b)/n ] x [ (a+c)/n ] .
However,
we are not only interested in the P(A1 ∩ B1) but the expected value
of A1 and B1. The expected value of A1 and B1 is the probability of A1 and
B1 occurring in the total number of observation. Hence,
the expected value of A1 and B1 = P(A1 ∩ B1) x n
= [ (a+b)/n ] x [ (a+c)/n ] x n
= (a+b)(a+c)/n
Similarly, we obtained the expected values for the other cells. Here are the expected frequencies:

The chisquared statistics (U) is
U = ∑ 
(O  E)^{2} E 
where U has a chisquared with υ degree of freedom; χ^{2} (υ) where υ = (m1)(n1). Thus, the degree of freedom for the example here is (21)(21) = 1.
Using the example above,
U = 
(8361.95)^{2} 
+ 
(3556.05)^{2} 
+ 
(2243.05)^{2} 
+ 
(6038.95)^{2} 
61.95 
56.05 
43.05 
38.95 
U ≈ 36.73
Well, does this represents a large or small difference? To figure that out, we need to use a critical value (c). You can think of this critical value c as the cutoff point where beyond a certain number the value U is considered large. This critical value c is decided by two factors, the degree of freedom υ and the level of significant that we desired to have. Usually, we will choose 5% significant level, that is, we are only willing to accept 5% or less probability that we rejected the null hypothesis wrongly, i.e. when the null hypothesis is actually true. In our case, the critical value is determined by 1 degree of freedom at 5% significant level. From the chisquared table , the corresponding critical value with 1 degree of freedom and 5% significant level is 3.841.
Reading from the chisquared table is tedious. Is there another way that I can use to determine the size of U? The answer is yes. There are three methods using Excel. Method 1.
Method 2.
In this example, the returned probability is 1.36E09. Your report should be as follows: U = 36.73 ( p<0.00005) and we thus reject the null hypothesis at 5% significant level because the pvalue of U=36.73 is smaller than 1%. Method 3. Click here to learn how to carry out a contingency table using Excel. [Also study the solutions (in Excel format) to the exercises below. This Excel file can also be used as template to carry out Contingency Test for categorical data with m=2 and n=2 (2 by 2 table).] 
Our chisquared statistics U of 36.73 is larger than the critical value 3.841 at 5% significant level. Hence, we reject our null hypothesis at 5% significant level (see diagram). We thus failed to reject the alternative hypothesis that sleeping early and achieving above average score in Maths are NOT independent. Note that this is a onetailed (upper tail) test.
This test of independence can only tell us whether or not two factors are independent. If these two factors are not independent then it suggests that there is a relationship between them. However, the test of independence is unable to tell us the form of relationship that may exist between these two factors.


Exercises: 1. A cake producer believes that the ability to distinguish butter from margarine by taste is related to gender. A group of 400 blindfolded individuals are asked to distinguish butter from margarine by taste. Here are the results:
Let the null hypothesis be " the ability to tell the distinguish butter from margarine by taste is independent of gender." Use the Contingency Table Test to verify the cake producer's belief at 5% significant level.
2. In a clinical trial of a drug for arthritis, 168 patients received treatment with drug X, and a control group of 132 received treatment with placebo (which is nonactive). Their conditions were checked after a period of time.
(a) Write out the null hypothesis and alternative hypothesis. (b) Use Contingency Table Test to verify whether or not Drug X is effective at 5% significant level.
3. A teacher conducted a survey to see whether or not students who complete theory assignments regularly also tend to score at or above average in test. This is her findings:
(a) Write out your null hypothesis and alternative hypothesis. (b) Use Contingency
Test to verify the significance of the above results at 5% significant
level. 