LAB 4F: This Model Is Big Enough for All of Us
Lab 4F - This model is big enough for all of us!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Building better models
- 
So far, in the labs, we've learned how to make predictions using the line of best fit, also knowns as linear models or regression models.
 - 
We've also learned how to measure our model's prediction accuracy by cross-validation.
 - 
In this lab, we'll investigate the following question:
Will including more variables in our model improve its predictions?
 
Divide & Conquer
- 
(1) Start by loading the
moviedata and write and run code splitting it into two sets (see Lab 4C for help).– A set named
trainingthat includes 75% of the data.– A set named
testthat includes the remaining 25%.– Remember to use
set.seed. - 
(2) Write and run code creating a linear model, using the
trainingdata, that predictsgrossusingruntime.– (3) Write and run code creating the MSE of the model by making predictions for the
testdata. - 
(4) Do you think that a movie's
runtimeis the only factor that goes into how much a movie will make? What else might affect a movie'sgross? 
Including more info
- 
Data scientists often find that including more relevant information in their models leads to better predictions.
– (5) Fill in the blanks below to predict
grossusingruntimeandreviews_num.lm(____ ~ ____ + ____, data = training) - 
(6) Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.
 - 
(7) Write down the code you would use to include a 3rd variable, of your choosing, in your
lm(). 
Own your own
- 
(8) Write down which other variables in the
moviedata you think would help you make better predictions.– (9) Are there any variables that you think would not improve our predictions?
 - 
(10) Write and run code creating a model for all of the variables you think are relevant.
– (11) Assess whether your model makes more accurate predictions for the
test
data than the model that included onlyruntimeandreviews_num - 
(12) With your neighbors, determine which combination of variables leads to the best predictions for the
testdata.