Finding and using expected frequency for goodness of fit hypothesis testing

11/23/2018

Intro

Howdy! I'm Professor Curtis of Aspire Mountain Academy here with more statistics homework help. Today, we're going to learn how to find and use expected frequency for goodness of fit hypothesis testing. Here's our problem statement: Refer to the data in the accompanying table for the heights of females. Complete parts (a) through (d) below.

Part A

OK, Part A is asking us to “enter the observed frequencies in the table below.” So here they’ve got different categories or classes for height, and then we’re asked to fill in the different frequencies. If we look at our data in our table here, notice how we’ve got females mixed in with males. So the gender variable is a dummy variable where you’ve basically got two options: One option is a zero and the other is a one. Here the ones are males, so the zeros must be females. We’ve got to sort through all this data to get just the heights for the females; we weren’t asked for the heights of the males, just the females. Notice also here the height data is actually unsorted itself too.

So we’ve got two different sortings to do, and so to do that, I want to open the data in Excel. So the thing is I want to actually use Excel because we could do it in StatCrunch but StatCrunch is really clunky, and especially when it comes to sorting data. The sort feature in StatCrunch only lets you sort one level at a time whereas with Excel, it will let you sort multiple levels at the same time. And that’s what we want because it makes our job a little easier.

So I’ve already pre-loaded the data here into Excel. And what I want to do now is actually sort this data out. So to do that, I’m going to come up to menu here — I’m coming off screen a little bit so you can’t see, but I’m selecting Data. And then I want to select Sort. And then here in the sort dialog box, I first want to sort by gender so we can get all the males out of the way. And then I’m going to add a level so that within the females I’m going to actually sort by height. This will actually help us to count to get the frequencies that we need to fill in our table here for our answer fields. So I hit OK, and now everything is automatically sorted.

The other thing that is nice about Excel is that it makes counting really easy. And that’s all we’re doing with frequency is we’re getting counts of measurements that fall within each of these different classes or categories. So we want to count the number of data points that are less than 155.15; that’s our first category or class here. So I’m going to select that first cell with my data point here in my data, and then I’m going to scroll down to where I get the — let’s see, 155.15. So 155.15 is going to be every data point up to this one. So I’m going to hold down the Shift key on my keyboard and press the left key on my mouse. So now I’ve selected all those data points that are less than 155.15. And if I look down here at the bottom of my Excel window, I see that the count here is 20. So that made the counting super easy for me. I just put in a 20 there.

And now I’m going to do the same thing for each of the different other classes, so I’m going to select the next cell here, and go down to 161.75. So I scroll down to . . . 161.75 would be there, and 41 is the count. Now, let’s see, 168.35. That would be up to there. The count is 34. “Greater than 168.35" — so this is the last category, so I’m going to go up to the last female data point, which is this one right here. Beyond that you get all the male data points. So this is my last category count — 19. Excellent!

Part B

Now, Part B says, “Assuming a normal distribution with mean and standard deviation given by the sample mean and standard deviation, find the probability of a randomly selected height belonging to each class.” So the probabilities are going to come out of our distribution calculator in StatCrunch; that’s the easiest way to get this. We want a normal distribution, and the mean and standard deviation are coming from our sample data itself.

So first I’m going to get the sample mean and standard deviation and put those values into StatCrunch. To do that, the first thing I’m going to do is get rid of all the male data points here because we don’t need them. So I’m just going to scroll down here, select the row, delete all that. Now down here under the height column, I’m going to put in an AVERAGE function, I’m going to select all those data points, then close my parenthesis — there’s my average. I’ll put the standard deviation just below it, and there’s my standard deviation. So this is what I need to put into StatCrunch.

So the easiest way for me to get into StatCrunch is just to put my data in, although once I get my data here into StatCrunch, see, the first thing I’m going to do is get rid of my data because I don’t need the data; I just need StatCrunch. So let’s move this down so we can see a little bit more what we’re doing. Alright, so here we’ve got the data in StatCrunch, and I’m just going to clear that out, and I’m going to clear you out.

So now I just want my distribution calculator. I want the Normal distribution, and the mean and standard deviation are going to come from Excel. So if I move this over here, I can stick in my mean value that I calculated in Excel and the standard deviation value, also from Excel. Notice I’m typing in all the numbers I have. OK, that’s great. So now I’ve got this, and I can get rid of that.

And to get the probability, I want less than 155.15, so less than is here. I just need to put in this random variable 155.15 — there’s my probability. I’m asked to round to four decimal points. And I’m going to do the same thing for each of the other four categories. The next one is between two values, so I’m going to select the Between option in StatCrunch. And then here we’re going to select 155.15. Here we’re going to select 161.75. There’s my probability. And then I just do the same thing for the columns that remain, again rounding to four decimal places. And I want greater than, so I go back to my Standard option, change this to greater than, and this one’s 168.35. And there’s my final probability. Good job!

Part C

Now, Part C says, “Using the probabilities found in Part B, find the expected frequency for each category.” This is pretty simple. All we have to do is get the total number of frequencies and then multiply it by each respective probability for each class. So I can add these numbers up, or if I want to be lazy — yeah, I want to be lazy! — so I’m going to go back to Excel, recognize that this first row is taken up with the column headings, so I’ve got 115 minus the one for the column headings is 114. So my sum is 114. If you want you can add these four numbers up, you’ll get the same thing.

So 114 is what I want to multiply by each one of these respective probabilities to get the expected frequency counts. So I’ll pull out my handy dandy calculator here, and let’s move you down a little bit. OK, so, we have 114 times the first probability for the first category here is 0.1971. So there’s my expected frequency for the first class or category. And I just do the same thing with the numbers that remain. So 114 times the next probability gives me the next expected frequency. And I’m just going to finish this out here. Oops! That’s the wrong number. There’s the number I want. Excellent!

Part D

Now Part D asks for a hypothesis test, and there’s different parts to this, so let’s take a look. The first section in Part D says, “Identify the null and alternative hypothesis for this test.” For goodness of fit, it’s always going to be the same thing. Your null hypothesis will be everything’s equal. The alternative is going to be at least one of them is different.

But you’re not just looking at one part; you’re looking at both parts together. There’s a part in Part A, and then there’s a corresponding part in Part C, because you’re looking at observed frequencies and expected frequencies. And so what we’re saying is that the observed and the expected should be the same. That’s the null hypothesis. And then at least one of those categories, it’s not going to be the same; it’s going to be different. So if I look back here at my answer options, I’m seeing . . . this one — answer option D. So for each class, the answer from Part A equals the one from Part C. And then the alternative is for at least one of them the answer is not equal. So that’s what I want to select. Good job!

Now the next part asks for the test statistic. This part of the problem typically drives students bonkers until you understand that, the way they give this out, it’s easy to get it into StatCrunch, but you can’t just stick the data in. The data is in the wrong format because the data is just the raw data. StatCrunch needs, to do the goodness of fit test, is frequency counts. So you’ve got to put in the observed and the expected frequency counts. And you’ve already tabulated those up. Here’s your observed frequency counts, and here’s your expected frequency counts. So all I need to so is go back into StatCrunch and then put those numbers in, and then I can run my goodness of fit test.

So here’s my StatCrunch window back, and I’m going to clear this distribution calculator out. So here in the first column, I’m just going to label this Observed. And then the next one I’m going to label Expected, so I can tell them apart. Now I’m going to come up here, and I’m going to copy these numbers from Part A up here in my Observed column. So I’ve got — whoops! — 20, 41, 34, and 19. We’re going to do the same thing for my Expected column. Make sure they’re in the same order so your answer can come out right.

Now I’ve got the data here in StatCrunch. This is the frequency counts. This is what we need for goodness of fit testing. And it’s really easy. Come up to Stat –> Goodness-of-fit –> Chi-Square Test. The observed is the Observed; the expected, Expected; come down here and hit Compute!, e viola! There’s out test statistic right there in the results window. So it’s the next to last number there in that first table. How many decimal places do I want? Three? That’s going to be 0.764. Nice work!

“Identify the P-value.” Well, the P-value is right there next to the test statistic. It again wants three decimal places. Excellent!

And now “state the final conclusion that addresses the original claim.” Well, our P-value is going ot be well above any significance level that we’re going to want to test for. Here our significance level is 1%. Here we’ve got 85.8%, so we’re definitely outside the region of rejection, therefore we fail to reject. Whenever we fail to reject, there is not sufficient evidence. Well done!

And that's how we do it at Aspire Mountain Academy. Feel free to leave your comments below and let us know how good a job we did or how we can improve. And if your stats teacher is just boring or doesn't want to help you learn stats, then go to aspiremountainacademy.com, where you can learn more about accessing our lecture videos or provide feedback on what you’d like to see. Thanks for watching! We’ll see you in the next video!

1 Comment