Skip to content

Lab 3B: Confound It All!

Lab 3B - Confound it all!

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

Finding data in new places

  • Since your first forays into doing data science, you've used data from two sources:

    – Built-in datasets from RStudio.

    – Campaign data from the Campaign Manager.

  • Data can be found in many other places though, especially online.

  • In this lab, we'll read an observational study dataset from a website.

    – We'll use this data to then explore what factors are associated with a person's lung capacity.

Importing our data

  • Rather than export-ing the data and then upload-ing and importing-ing it, we'll pull the data straight from the webpage into R.

  • You can find the data online here:

    – (Right-click and select Open in New Window)
    https://raw.githubusercontent.com/thinkdataed/dataset/main/fev.csv

  • Click on the Import Dataset button under the Environment tab.

    – Then click on the From Text (readr) option.

    – Type or copy/paste the URL into the box.

    – Click Update.

  • Before importing, change the following Import Options:

    – Name: lungs

    Uncheck the First Row as Names

    Change Delimiter to Whitespace

Our new data

  • Variables that were measured include:

    – Age in years.

    – Lung capacity, measured in liters.

    – The youth's heights, in inches

    – Genders; "1" for males, "0" for females.

    – Whether the participant was a smoker, "1", or non-smoker "0".

About the data

  • The data come from the Forced Expiratory Volume (FEV) study that took place in the late 1970's.

  • The observations come from a sample of 654 youths, aged 3 to 19, in/around East Boston.

  • Researchers were interested in answering the research question:

    What is the effect of childhood smoking on lung health?

Cleaning your data

  • Now that we've got the data loaded, we need to clean it to get it ready for use (Look at Lab 1F for help). Specifically:

    – We want to name the variables: "age", "lung_cap", "height", "gender","smoker", in that order.

    (1) Write and run code changing the type of variable for gender and smoker from numeric to character.

  • After changing the variable types for gender and smoker:

    (2) For gender, write and run code using recode to change "1" to "Male" and "0" to "Female".

    (3) For smoker, write and run code using recode to change "1" to "Yes" and "0" to "No".

Analyzing our data

  • Our lungs data is from an observational study.

  • (4) Write down a reason the researchers couldn't use an experiment to test the effects of smoking on children's lungs.

  • Observational studies are often helpful for analyzing how variables are related.

    (5) Do you think that a person's age affects their lung capacity? Make a sketch of what you think a scatterplot of the two variables would look like and explain.

  • (6) Write and run code using the lungs data to create an xyplot of age and lung_cap. Interpret the plot and describe why the relationship between the two variables makes sense.

Smoking and lung capacity

  • (7) Write and run code making a plot that can be used to answer the statistical investigative question:

    Do people who smoke tend to have lower lung capacity than those who do not smoke?

  • (8) Use your plot to answer the question.

    (9) Were you surprised by the answer? Why?

    (10) Can you suggest a possible confounding factor that might be affecting the result?

Let's compare

  • (11) Write and run code creating three subsets of the data:

    – One that includes only 13-year-olds ...

    – One that includes only 15-year-olds ...

    – and one that includes only 17-year-olds.

  • (12) Write and run code making a plot that compares the lung capacity of smokers and non-smokers for each subset.

  • (13) How does the relationship between smoking and lung capacity change as we increase the age from 13 to 15 to 17?

Sum it up!

  • (14) Does smoking affect lung capacity? If so, how?

    – Support your answers with appropriate plots.

    – Explain why you included the variables you used in your plots.