LAB 4D: Interpreting Correlations
Lab 4D - Interpreting correlations
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Some background...
- 
So far, we’ve learned about measuring the success of a model based on how close its predictions come to the actual observations.
 - 
The correlation coefficient is a tool that gives us a fairly good idea of how these predictions will turn out without having to make predictions on future observations.
 - 
For this lab, we will be using the
moviedata set to investigate the following question:Which variables are better predictors of a movie's
critics_ratingwhen the predictions are made using a line of best fit? 
Correlation coefficients
- 
The correlation coefficient describes the strength and direction of the linear trend.
 - 
It's only useful when the trend is linear and both variables are numeric.

 - 
(1) Are these variables linearly related? Why or why not?
 
Correlation review I

- 
Correlation coefficients with values close to 1 are very strong with a positive slope. Values close to -1 means the correlation is very strong with a negative slope.
 - 
(2) Does this plot have a positive or negative correlation?
 
Correlation review II

- 
Recall that if there is no linear relation between two numerical variables, the correlation coefficient is close to 0.
 - 
(3) What do you guess the correlation coefficient will be for these two variables?
 
The movie data
- 
(4) Write and run code loading the
moviedata using thedatacommand. - 
The data comes from a variety of sources like IMDB and Rotten Tomatoes.
– The
critics_ratingcontains values between 0 and 100, 100 being the best.– The
audience_ratingcontains values that range between 0 and 10, 10 being the best.–
n_criticsandn_audiencedescribe the number of reviews used for the ratings.–
grossandbudgetdescibes the amount of money the film made and took to make. 
Calculating Correlation Coefficients!
- 
We can use the
cor()function to find the particular correlation coefficient of the variables from the previous plot, which happen to beaudience_ratingandcritics_rating. - 
But note, the
cor()function removes any observations which contain anNAvalue in either variable. - 
(5) Write and run code calculating the correlation coefficient for these variables using the
cor()function. The inputs to the functions work just like the inputs of thexyplotfunction. 
Now answer the following.
- 
(6) What was the value of the correlation coefficient you calculated?
 - 
(7) How does this actual value compare with the one you estimated previously?
 - 
(8) Does this indicate a strong, weak, or moderate association? Why?
 - 
(9) How would the scatterplot need to change in order for the correlation to be stronger?
 - 
(10) How would it need to change in order for the correlation to be weaker?
 
Correlation and Predictions
- 
(11) Find the two variables that look to have the strongest correlation with
critics_rating.– (12) Compute the correlation coefficients for
critics_ratingand each of the two variables.– (13) Use the correlation coefficient to determine which variable has a stronger linear relationship with
critics_rating. - 
(14) Write and run code fitting two
lm()models to predictcritics_ratingwith each variable and compute the MSE for each.– (15) Use the MSE to determine which variable is a better predictor of
critics_rating. - 
(16) How are the correlation coefficient and the MSE related?
 
On your own
- 
(17) Select two different numerical variables from the
moviedata. Plot the variables using thexyplot()function.– (18) Would calculating a correlation coefficient for the two variables be appropriate? Justify your answer.
– (19) Predict what value you think the correlation coefficient will be. Compare this value to the actual value. Finally, interpret what the actual correlation coefficient means.
 - 
(20) Work with your classmates to determine which two variables have the strongest correlation coefficient.
– (21) Why do you think these variables are so strongly related?
– (22) Is using the correlation coefficient to describe the relationship appropriate and why/why not?