Finding the best predicted value using the best multiple regression model

5/31/2019

Intro

Howdy! I'm Professor Curtis of Aspire Mountain Academy here with more statistics homework help. Today we're going to learn how to find the best predicted value using the best multiple regression model. Here's our problem statement: The accompanying tables shows results from regressions performed on data from a random sample of 21 cars. The response variable is CITY (fuel consumption in miles per gallon), the predictor or x variables are WT (weight in pounds), DISP (engine displacement in liters) and HWY (highway fuel consumption in miles per gallon). The equation CITY = -3.16 + 0.822HWY was previously determined to be the best for predicting city fuel consumption. A car weighs 2780 pounds. It has an engine displacement of 1.5 liters, and it's highway fuel consumption is 42 miles per gallon. What is the best predicted value of the city fuel consumption? Is that predicted value likely to be a good estimate? Is that predicted value likely to be accurate?

Part 1

OK, so there's two parts to this problem. And in the first part, we're asked for the best predicted value for the city fuel consumption. We're already given the regression equation that is the best. But if we wanted, we can actually click on this icon here and take a look at the different regression equations from which that selection was made. In picking the best regression equation, remember that we're having to balance out three different values. We want the best P value, which is the lowest P value. And technically because the equations that we're looking at have different numbers of variables in them, we want to compare adjusted R squared values and not R squared values because adjusted R squared values account for the differences in the number of variables between equations. And the best adjusted R squared value is typically the highest. And then we look at the number of variables in our regression equation. And typically we want the simplest equation. And the simplest equation will have the fewest number of variables. So there's three different sets of criteria that we're looking to balance between to pick out the right equation.

Now the best equation that was selected to be this last one here. It has only one variable. It's got the best P value. Actually, all these equations here have the best P value, so we're not really using P value in our assessment. But then look at the adjusted R squared value. You notice it's not the highest. The highest is listed here; it's 0.936. Now me personally, if I were selecting this based on my experience working in industry, I would take the 0.936 because look at this regression equation. It's got two variables in it. For a little more complexity, you get a little bit more adjusted R squared value, so it's a little better predictor. So based on my experience in industry, that's what I would select.

That's not how the author of your textbook is thinking though, and the author of your textbook is the one who actually wrote out these homework problems. If I take my calculator here, I can show you what I think he's looking at. What's the percentage difference between these two adjusted R squared values? If I take the first one (0.936) and I subtract out the 0.920 --- so the difference between them is 0.016. If I divide that by the 0.936, see, there's only about a 2% difference between those adjusted R squared values. And so, from the way that the author of your textbook is thinking --- and this is just me thinking, I haven't actually talked to the guy --- but just looking at this 2% difference, so you make your equation twice as complex by adding in an extra variable. But you're only getting 2% benefit in that adjusted R squared value out of it. So from his line of thinking, that's not worthwhile. And that's why I think he's selecting this last equation here to be the best regression equation.

That said, we can actually use that regression equation. It's listed here in the problem statement and we'll just use that to make our prediction. So we have -3.16, and we're going to add to that the second term, which is 0.822 times the highway fuel consumption, which here is listed as 42 miles per gallon. And it says here "Type in an integer or a decimal. Do not round." So I'm going to put that in here. And the units on the fuel consumption will be miles per gallon, the same as the highway fuel consumption. You see here --- fuel consumption is in miles per gallon. So that's what I'm going to put here. Nice work!

Part 2

Now the second part of the problem has some drop down fields. So let's take a look at these. The predicted value is likely to be a good estimate. It's likely to be a good estimate because we used the best regression equation to bring us that estimate, and that regression equation has the best possible P value it could have. It has a really good adjusted R squared value. And it's got the lowest number of variables that you can have in the equation. So it's got the best balance of those three sets of criteria. And therefore it's likely to be a good estimate.

However, that estimate is not likely to be very accurate. The reason why it's not likely to be very accurate is your sample size is really small. You only got 21 cars. So you've got 21 data points in your data set. That's not a whole lot. So because we've only got a small sample size here, that regression equation is not likely to be very accurate. Nice work!

And that's how we do it at Aspire Mountain Academy. Be sure to leave your comments below and let us know how good a job we did or how we can improve. And if your stats teacher is boring or just doesn't want to help you learn stats, go to aspiremountainacademy.com, where you can learn more about accessing our lecture videos or provide feedback on what you'd like to see. Thanks for watching! We'll see you in the next video.

0 Comments