Finding the best regression equation given multiple variables

11/3/2018

Intro

Howdy! I'm Professor Curtis of Aspire Mountain Academy here with more statistics homework help. Today, we're going to learn how to find the best regression equation given multiple variables. Here's our problem statement: The accompanying table provides data for tar, nicotine, and carbon monoxide (CO) contents in a certain brand of cigarette. Find the best regression equation for predicting the amount of nicotine in a cigarette. Why is it best? Is the best regression equation a good regression equation for predicting the nicotine content? Why or why not?

Part 1

OK, the first part of our problem asks us to find the best regression equation, and notice we’ve got three different answer options to select from. And that’s because we’re looking at three different models. So the first model (answer option A) has just the carbon monoxide content. The second answer option (answer option B) — that model has only the tar content. And the last model (answer option C) has both the tar and the carbon monoxide for variables.

So we need to make regression equations for each of these models. And we need to compare values from each of those models. So to make the models, first we need to get the data and dump it into StatCrunch. So to do that, I’m going to click on this icon; this brings up a table with the data. And now I’m going to stick that data into StatCrunch. I’ll resize this window so we can see a little bit better everything that’s going on.

And now, to make these first two models, we could go into Stat –> Regression –> Simple Linear. But we know we’re going to have a model with two variables there at the end. So let’s just use the one menu option of going to Stat –> Regression –> Multiple Linear.

Here in my options window, I can select my Y-variable. This is what comes out of my regression model, which in this case is the nicotine. And then the first model I want to make has just the carbon monoxide, so I’m going to select that here for my X-variable.

There are no interactions with this model. Interactions are where you have more than one variable being multiplied together to make another term in your regression model equation. Here we only have single variables in each individual term, so there’s no variables being multiplied together. And so we don’t have any interactions, so there’s nothing to select here.

And these default options here where these boxes are not selected is just fine for us. So we press Compute! And out comes this results window that has the results that we need for evaluating this particular regression model.

To help us evaluate the models, what I’ve done is gone to Excel and made a little chart here. So what we can do is copy that information over into Excel for each of the models, and then we can compare in one spot which model is the best and then take the values from that model and stick it into the answer fields that are appropriate for our assignment.

So the first thing we’re looking for is the adjusted R-squared value. That’s going to be down here towards the bottom of my results window. The P-value — notice there’s different P-values here in my parameter estimates table. The one that I want, though, is here in the ANOVA table. I want this P-value here in the ANOVA table; that’s the one that I want for the model.

And then, just in case we end up selecting this model, so we don’t have to go back and redo all of this, let’s just take the values for the intercept and the slope and stick them here in our table in Excel. Our assignment is asking us to round to three decimal places, so I’m going to take these values out to five decimal places. I want two extra decimal places so that I can avoid rounding errors to put my actual answer into my assignment fields. But I don’t want to incur any rounding errors that come from rounding these values themselves. I don’t want to transcribe the entire number, so in order to avoid that, I just want to shorten this up to transcribe it here. So I’m just going to take two extra decimal places, so that means I want five. There. And there’s the first model.

Now to get the second model — OK, notice what we need for the second model. The second model is we’re just looking at the tar. So I’m going to replace the carbon monoxide with the tar. Instead of going through the menu options again in StatCrunch, I’m just going to come up the Options button here, click on Edit, and then I’m just going to switch from CO to Tar. Hit Compute! And now I’ve got a new model.

And I can take those values out. So my adjusted R-squared value is down there at the bottom of the screen. And my P-value I get from my ANOVA table. Notice it says we have less than 0.001; that’s for all practical purposes zero. And then I take my intercept value and my slope. And that’s the second model.

Now I’m ready to make the third model. I’m going to go back into my options window to do that. Notice the order in which the variables appear. Tar is first, and then comes CO. So I’m going to put those variables in the same order in my regression model so that when I’m transcribing numbers out, it’ll be easier not to get them confused. And here’s my last model. Adjusted R-squared value goes here. P-value goes there. I’ve got an intercept, slope 0.09596, and the last slope value — notice the negative sign there.

OK, so now I’ve got my values here. Now I can bring that over here, and we can compare and see what we’re looking at here. We want a high value for adjusted R-squared and a low value for the P-value. Well, looking at the P-values, answer option A has a significantly higher P-value than the other two options, so we’re going to take that, and we’re just going to cross that off our list. So we’re not going to look at that any more.

And now we’re choosing between answer options B and C. They have the same P-value, so we look at the adjusted R-squared value. And answer option C has a significantly higher value for adjusted R-squared. So we’re going to select answer option C. If the adjusted R-squared values were reasonably close together, then we would say that adding in this extra variable doesn’t give you that much more benefit from a higher adjusted R-squared value, so it’s not going to make that better of a model. But this is a ten percentage point difference here; that’s pretty significant. So we’re going to say that answer option C is the one that we’re going to want to select. And if I wanted to highlight that, I could do something like that so I can make sure I get the right numbers out.

And then I just transcribe my numbers here. So I want three decimal places. I’ve got them rounded to five so I can avoid rounding errors when I’m putting them here in my answer field. So the first value is my intercept, and then I want the first slope, and then I want the second slope. Again, note the negative sign. Good job!

Part 2

Now the second part of our problem asks us, “Why is this equation best?” Well, as we just got done saying, we’ve got a high adjusted R-squared value and a low P-value; those are the main two determinants that we’re looking for. The other thing we look for is the number of variables, and though we’ve got more number of variables in this equation than we do in the other two, we’ve got a significantly higher adjusted R-square value that makes adding that extra variable worthwhile.

So we want highest adjusted R-squared value, so looking at my answer options, it could be B, or it could be D. I want low P-value, so we’ve got low P-value here, and low P-value here, so that’s good. This says, for answer option B, “removing either predictor noticeably decreases the quality of the model.” And that’s true. If you take that second variable out, notice you get a ten percentage point drop in adjusted R-squared.

So that’s a possibility, but let’s check D just to be sure. It says only a single predictor variable is in our equation, and we’ve noticeably got two. So it can’t be D; it has to be B. Fantastic!

Part 3

And now the last part of our problem asks, “Is the best regression equation a good regression equation for predicting the nicotine content? Why or why not?” Well, here you want to be looking at your P-value. Here you’ve got the lowest P-value you could possibly have, which is zero. And so that’s going to tell us that, yes, this model will fit our data pretty well.

So we want the answer option that says, “Yes.” So that’s going to be A or D. And A says, “Small P-value indicates good fit.” Answer option D says, “Large P-value indicates good fit.” Obviously that’s not true, so we want answer option A. Good job!

And that's how we do it at Aspire Mountain Academy. Feel free to leave your comments below and let us know how good a job we did or how we can improve. And if your stats teacher is just boring or doesn't want to help you learn stats, then go to aspiremountainacademy.com, where you can learn more about accessing our lecture videos or provide feedback on what you’d like to see. Thanks for watching! We’ll see you in the next video!

0 Comments