This post explains a basic question I encountered, and the statistical concepts behind it.
The real-life problem
Someone asks me to construct data to prove that a treatment is useful for 1) kindergarten and 2) elementary school kids in preventing winter cold.
Chi-square and student’s t test
First, decide how to formulate the problem using statistical tests. This includes deciding the quantity and statistic to compare.
Basically, I need to compare two groups. Two tests come to mind: Pearson’s chi-square test, and two-sample t-test. This article summarizes main difference between the two tests, in terms of Null Hypothesis, Types of Data, Variations and Conclusions. The following section is largely based on that article.
- Pearson’s chi-square test: test the relationship between two variables, or whether something has effects on the other thing (?). e.g. men and women are equally likely to vote for Republican, Democrat, Others, or Not at all. Here the two variables are “gender” and “voting choice”. The null is “gender does not affect voting choice”.
- Two-sample t-test : whether two sample have the same mean. Mathematically, this means or . e.g. boys and girls have the same height.
Types of Data
- Pearson’s chi-square test: usually requires two variables. Each is categorical and can have many number of levels. e.g. one variable is “gender”, the other is “voting choice”.
- two sample t-test: requires two variables. One variable has exactly two levels (two-sample), the other is quantitively calculatable. e.g. in the example above, one variable is gender, the other is height.
- Pearson’s chi-square test: variations can be when the two variables are ordinal instead of categorical.
- two-sample t-test: variations can be that the two samples are paired instead of independent.
Transform the real-life problem into a statistical problem
Using chi-square test
Variable 1 is “using treatment”. Two levels: use or not.
Variable 2 is “getting winter cold”. Two levels: get cold or not.
For kindergarten kids and for pre-school kids, I thus have two 2 * 2 tables.
(question: can I do a chi-square test on three variables? The third one being “age”.)
Using two-sample t-test
Variable 1 is “using treatment”. Two levels: use or not
Variable 2 is supposed to be a numerical variable —- here use disease rate. But then there is no enough number of samples.
Thus, conclude that Chi-square test should be used here.