Friday, May 17, 2013

Histograms or Bar Graphs?

Most textbooks introduce histograms and bar graphs fairly early. The chapter before a chapter on various graphs will usually introduce students to the four scales for data analysis, which are nominal, ordinal, interval, and ratio.

Unfortunately, many learners do not connect the four scales to the correct choice of graph. This problem is especially prevalent for bar graphs and histograms.

Bar graphs should be used for displaying frequencies for nominal data. For review, nominal data is that which can be placed into categories. Gender is an example where the categories are male and female. We could assign numbers such as 0 and 1 for male and female; however those numbers have no quantitative value other than serving as numerical labels. 

You can see that the x-axis below uses a nominal variable (gender). The y-axis then is just the mean time studying (fictional data). A key feature of the bar graph is that the bars are not touching each other. This makes it easier to see that that bars are representing separate categories.



Histograms, on the other hand, are for numerical data such as that on the ordinal, interval, and ratio scales. (Actually, one can get away with bar graphs for ordinal data if there are only a few levels such as low, medium, and high. Yes, they are rank-ordered but they can also be viewed as distinct nominal categories.) But what about the previous y-axis variable of study time? In my data, those are numbers ranging from 6 to 23. Those numbers are not nominal categories (unless I go a step further and create categories from the numbers). So, what if I want to get an idea of frequencies for these numbers? I should use a histogram. Usually, the bars will be touching each other. Why is this important? Well, a good reason for histograms is to view the frequencies for numerical data, but an equally important reason is to take a look at the shape of the distribution. If the bars aren't touching, it becomes more difficult to judge the shape.

With this histogram, we can see that this distribution has an odd shape to it. First, we see that one value (23) is not touching the other bars. That's because there are no values between it and the next lowest value. Second, we see that including this 23 (a potential outlier) suggests a distribution that is approaching being normal and bell-shaped while slightly positively skewed.

Note that if we had used a bar graph, we might not even notice this potential outlier of 25. Check this out!



The scaling on the x-axis is misleading. It moves up by twos from 6 to 10, and then the trouble begins. At the end of the x-axis, the value jumps from 15 to 23, but the space between bars is the same as for all others! The bar graph version makes it nearly impossible to determine the shape of the distribution!

 *************************
For assistance with statistics,
visit StatRelief
*************************

Thursday, January 3, 2013

Continuity Corrections for McNemar

I was playing with some repeated measures two-way tables today when noticing something interesting. I had run the analysis by hand to get a chi-square of 5. When I ran it using mcnemar.test in R, I received a 3.2. My first hunch was that this is a corrected value; sure enough, after turning off the continuity correction via correct=False in the syntax, I got my 5. :) 

What's interesting here is there is considerable debate over using this continuity correction. The idea is that it is only necessary for small samples (e.g., less than 10) (Agresti, 2006) in cases where it may not be safe to make the leap from binomial to chi-squared = z-squared normal approximations. It is reasonable to argue, "let's be on the safe side and always use it"! It's also reasonable to argue, "we have exact tests now and never need to use this"!

I think that it can definitely matter. In this case, my results led to rejection of the null hypothesis under the typical .05 criterion, p = 0.025. However, with the continuity correction, there is not sufficient evidence to reject the null, p = 0.073.  My sample size is 37, which is more than three times higher than the "10" rule of thumb. But why 10?

So, I ran this again letting the exact test play judge, jury, and executioner. The verdict? p = 0.0625 and no rejection of the null. The exact test would lead to the same decision as the more conservative continuity correction test. 

Perhaps, Agresti (1996) is wise to say that the confidence intervals would be more informative. 

*************************
For assistance with statistics,
visit StatRelief
*************************
 

 


Sunday, December 30, 2012

The Consquences of Assuming a Population from a Sample

One of the primary reasons for courses in statistics is the fact that when we look at data, it is almost always just a sample from a larger population. This means that any decision that we make is going to be prone to error because we don't have all of the data. So, it becomes our goal to make correct decisions given that we just have some of the data and not all of it.

Let's use an example to see why it matters. The data that I will use is 2011 science mean scale scores for 11th graders on the Florida Comprehensive Assessment Test (FCAT). The schools reporting a mean are the population (N = 632) . On a scale with a range of 100-500, the mean was 296.09 with a population standard deviation of 32.71.

A typical statistics course assignment will use information like this to help you learn how to calculate the likelihood of something happening. For example, what is the likelihood of a school reporting a mean of 315 or less?

You learn that you need to first convert the mean to a standard score (z-score), which can be done using the formula: z = (X - mu) / sigma where
  • X is the raw score of interest
  • mu is the population mean
  • sigma is the population standard deviation
So, z = (315 - 296.09) / 32.71 =  0.578

A table is then used (or a calculator) to determine the probability of something being less than that standard score, and the answer turns out to be approximately .718. In other words, about 72% of schools report means of 315 or less. 

Now, all of this material is rather standard in an introductory statistics course, but the problem really begins when the learner doesn't realize that the problem deals with a sample, not a population. Suppose that you are given the same information but the problem also states that the data is based on a random sample of 100 schools. Now, we have to make a decision but based only on 100 schools rather than the population of 632 schools. That changes quite a bit. It even changes the formula to 

z = (X - mu) / (sigma / sqrt(n))

The main change here is that now sigma (the standard deviation) has to be divided by the square root of n which is the sample size of 100. Note that, technically, the standard deviation formula is also slightly different for a sample. But, let's keep that number the same because the main concern of this blog entry is what happens when you account for sample size. So, what happens?

z = (315- 296.09) / (32.71 / sqrt(100)
z =  18.91 / 3.271
z = 5.78

Note that almost 100% of scores in a normal distribution will be between standard scores of -3 and +3. This answer is so far away from 3 that we conclude that the likelihood of a randomly selected school reporting a mean less than 350 is virtually 100%! (The actual percentage is 99.997%)

In this case, by considering a sample rather than a population, the result changed from 72% to nearly 100%. That's a huge difference in both the answer and how you would interpret this in the real world. And, I have seen problems where the difference in answers is even more severe.

The take-home point? Always look out for whether the problem is addressing a sample or a population.


*************************
For assistance with statistics,
visit StatRelief
*************************


Tuesday, May 22, 2012

ANOVA post hoc testing in Excel

Many students use the Excel Data Analysis ToolPack to solve statistics problems. A common problem requires the student to perform a one-way ANOVA followed by post hoc tests, if necessary. Unfortunately, Excel only provides the ANOVA results. What does one do?

For the sake of example, I recently tried an iced flavored coffee that is now sold in grocery stores. Personally, I loved it! But I had to wonder...is that because I love both hot and cold coffees anyway? So, imagine that you asked 15 people to rate their liking of this product on a 1-10 scale where 10 means that he or she absolutely loves it. Suppose that of these 15 people, 5 drink hot coffee, 5 drink hot and cold coffee, and 5 don't drink coffee at all.

First, I set the null hypothesis that, in the population, the mean rating is equal for all groups. The alternate hypothesis is that, in the population, at least one group has a mean rating that is significantly different from another group. The appropriate test for this hypothesis is a one-way between-subjects ANOVA. In Excel, I could set up the data like this:



Hot HotCold None
3 8 2
3 10 4
2 9 3
1 9 1
4 10 1

 Using the Excel Data Analysis Toolpack, I would choose "Anova: Single Factor" and set it up like this




 This would be my result

















Look at the p-value. 4.45775E-07 is scientific notation telling us to move the decimal 7 places to the left, which yields .000000457757. This is incredibly close to zero, and I should definitely reject the null hypothesis, F(2,12) = 62.649, p < 0.001

But which groups differ from each other?

To answer that, I can simply run independent samples t-tests comparing each group.

There are three comparisons to make here:

1. Hot versus HotCold
2. Hot versus None
3. HotCold versus None

I need to make all three comparisons. In this blog, I'm just going to compare those who drink hot coffee with those who drink both hot and cod coffee. I'm going to assume equal variances and set up Excel using the Data Analysis ToolPack like this:





Here are my results:






I'm using a two-tailed test because I'm not hypothesizing that one mean is greater than or less than the other. I just want to know if they are different. These results suggest that the two means are, indeed, significantly different from each other, t(8) = -10.436, p < .001.

But, wait....my instructor said that I need to adjust the significance level to avoid capitalizing on chance when making this many comparisons! There are many ways to do that. The quickest way is to use the Bonferroni correction. 

We usually set the significance level at .05, right? The Bonferroni correction is made by dividing this .05 by the number of comparisons being made. In this case, as noted above, there are three comparisons to be made, so the Boneferroni correction is simply .05/ 3, which equals .0167.

Are the results for this test still significant? Yes, because the two-tailed p-value of 6.17E-06, which after moving the decimal place six places to the left is .00000617, is still much lower than the new significance level of .0167.

But what if the original p-value had been .04? Without the Bonferroni correction, we would say that the two groups are different because .04 is less than .05. However, with the Bonferroni correction, we could not say this because .04 is not less than the corrected significance level of .0167! So, yes, it can really make a difference.


*************************
For assistance with statistics,
visit StatRelief
*************************