Lab 3B: Confound It All!
Lab 3B - Confound it all!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Finding data in new places
- 
Since your first forays into doing data science, you've used data from two sources:
– Built-in datasets from RStudio.
– Campaign data from the Campaign Manager.
 - 
Data can be found in many other places though, especially online.
 - 
In this lab, we'll read an observational study dataset from a website.
– We'll use this data to then explore what factors are associated with a person's lung capacity.
 
Importing our data
- 
Rather than export-ing the data and then upload-ing and importing-ing it, we'll pull the data straight from the webpage into R.
 - 
You can find the data online here:
– (Right-click and select Open in New Window)
https://raw.githubusercontent.com/thinkdataed/dataset/main/fev.csv - 
Click on the Import Dataset button under the Environment tab.
– Then click on the From Text (readr) option.
– Type or copy/paste the URL into the box.
– Click Update.
 - 
Before importing, change the following Import Options:
– Name:
lungs– Uncheck the First Row as Names
– Change Delimiter to Whitespace
 
Our new data
- 
Variables that were measured include:
– Age in years.
– Lung capacity, measured in liters.
– The youth's heights, in inches
– Genders;
"1"for males,"0"for females.– Whether the participant was a smoker,
"1", or non-smoker"0". 
About the data
- 
The data come from the Forced Expiratory Volume (FEV) study that took place in the late 1970's.
 - 
The observations come from a sample of 654 youths, aged 3 to 19, in/around East Boston.
 - 
Researchers were interested in answering the research question:
What is the effect of childhood smoking on lung health?
 
Cleaning your data
- 
Now that we've got the data loaded, we need to clean it to get it ready for use (Look at Lab 1F for help). Specifically:
– We want to name the variables:
"age","lung_cap","height","gender","smoker", in that order.– (1) Write and run code changing the type of variable for
genderandsmokerfrom numeric to character. - 
After changing the variable types for
genderandsmoker:– (2) For
gender, write and run code usingrecodeto change"1"to"Male"and"0"to"Female".– (3) For
smoker, write and run code usingrecodeto change"1"to"Yes"and"0"to"No". 
Analyzing our data
- 
Our
lungsdata is from an observational study. - 
(4) Write down a reason the researchers couldn't use an experiment to test the effects of smoking on children's lungs.
 - 
Observational studies are often helpful for analyzing how variables are related.
– (5) Do you think that a person's age affects their lung capacity? Make a sketch of what you think a scatterplot of the two variables would look like and explain.
 - 
(6) Write and run code using the
lungsdata to create anxyplotofageandlung_cap. Interpret the plot and describe why the relationship between the two variables makes sense. 
Smoking and lung capacity
- 
(7) Write and run code making a plot that can be used to answer the statistical investigative question:
Do people who smoke tend to have lower lung capacity than those who do not smoke?
 - 
(8) Use your plot to answer the question.
– (9) Were you surprised by the answer? Why?
– (10) Can you suggest a possible confounding factor that might be affecting the result?
 
Let's compare
- 
(11) Write and run code creating three subsets of the data:
– One that includes only 13-year-olds ...
– One that includes only 15-year-olds ...
– and one that includes only 17-year-olds.
 - 
(12) Write and run code making a plot that compares the lung capacity of smokers and non-smokers for each subset.
 - 
(13) How does the relationship between smoking and lung capacity change as we increase the age from 13 to 15 to 17?
 
Sum it up!
- 
(14) Does smoking affect lung capacity? If so, how?
– Support your answers with appropriate plots.
– Explain why you included the variables you used in your plots.